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INTRODUCTION 


Studies on the migration and isolation of selected ancient populations of India- 
a Non Recombinant Y chromosome (NRY) study 
The Deccan and the Dravidian regions 
1. Introduction 

The Indian subcontinent is the Southern extension of mainland Asia, mostly 
formed by the Indian Plate protruding into the Indian Ocean. It is comprised of 
present day India, Pakistan and Bangladesh. This peninsular region measures a vast 
area of 4.4 million km” and is delineated by mountains such as Himalayas in the 
north, the Hindu Kush in the west, and the Arakanese in the east. It is surrounded by 
Indian Ocean on the south, Arabian Sea to the west and the Bay of Bengal to the east 
(Fig 1). In addition to these barriers, the perennial rivers of India, all originating in 
Himalayas such as Indus, Ganges and Brahmaputra restricted population movements 
in the subcontinent. This left the coastal lines as a high way for easy movements 
during early days, both by animals and man. The first coastal migration of modern 
man from Africa to Australia was through the coasts of India (Wells et al., 
2001;Wells and Read, 2002; Henn et al., 2012). 

The arrival of modern Man has been traced back to 75,000 Ybp with the first 
migration of Man (Homo sapiens sapiens) with earlier hominids including Homo 
erectus 500,000 Ybp (Bongard-Levin,1979). Isolated remains of Homo erectus in 
Hathnora in the Narmada Valley in Central India suggested early occupation since the 
Middle Pleistocene era (500,000 and 200,000 Ybp). The archaeological site, Soan 
River valley in the Sivalik region contains Paleolithic hominid remains (Rendell, 
1989). Inspite of serious efforts, the remains of modern human (Homo sapiens 
sapiens) are yet to be discovered in India; though the caves of Sri Lanka (Pahiyangala 


in Bulathsinhala) has yielded the oldest complete anatomically modern human 


Figure 1: Topographic map of India 
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skeleton within south Asia. Earlier, two crania were discovered here and dated to 
37,000 Ybp (Deraniyagala and Siran,1996). The antiquity of South Asia thus is not 
new, animal and human habitations are known from time immemorial. 

The present day modern Republic of India is the seventh-largest country by 
area, the second-most populous country with over 1.2 billion people, living in 28 
states and 7 Union Territories. These states are very diverse in terms of their culture 
and language. Infact the states were carved on linguistic basis during post- 
independent India in 1956, on the grounds that people speaking a language share a 
culture (The State recognition Act, 1956). The diversity of her habits and habitats host 
many ancient fauna and flora leading to many mega biodiversity wildlife and 
protected habitats as on date. Of these the Western and Eastern Ghat mountains, 
Vindhyas and the ranges in the Northeast possessed ideal climatic conditions and 
altitude for bio-diversity to evolve and settle, thus making these terrain a haven for 
most of the tribes of India. As on date mankind in India is grossly divided into castes 
and tribes. The tribes constitute 8% of the total population of the country. They are 
represented in 745 scheduled tribes, numbering 84.51 million (2001 Census). They 
occupy 15% of the country’s area. Whether these tribes were the ancient settlers/or 
indigenous to India are big questions to answer. 

The question of ancient settlers and autochthonous origin / evolution of people 
in India have been an area of interest to many scientists, linguists and archaeologists. 
If one argues that tribes represent the earliest settlers in this land, this raises the 
questions, whether all the tribes all over India were the same in terms of their origin 
and evolution? Also whether the castes and tribes could have been derived from the 
same initial settlers? The issue is compounded when we consider the whole of India, 


particularly Northern and Northwestern India, because of the historical invasions that 


tend to mask the social fabrique. The famous Indus valley urban civilization (3300- 
1300 BCE; mature period 2600-1900 BCE) was a gate way to modern India. This 
extended from east of the Ghaggar-Hakra River valley (Possehl, 1990) until the upper 
reaches Ganges- Yamuna Doab (Leshnik and Junghans, 1968). It further extends from 
west Makran coast of Balochistan, via north and northeastern Afghanistan to south of 
Daimabad in Maharashtra. The civilization was spread over some 1,260,000 km?, 
making it the largest ancient civilization that subsisted on barley and wheat (Weber, 
1991). There on the developments in the Indo-Gangetic Doab and the various 
migrations and invasions also need to be taken into cognizance in investigating the 
genetic history of India. To this, considering the geography of India mentioned in the 
early part of this introduction and the large number of ancient and historic migrations 
that have occurred through various passes of Hindu Kush ranges and sea route along 
the Western coast and to a lesser extent the northeast range need to be considered as 
confounders. 

Modern science employing DNA technologies has attempted to unravel the 
mysteries of the origin of these Indian castes and tribes alike. Caste system in India is 
a unique social institution characterized by strict inbreeding and endogamy 
(Trautmann, 1981). It is generally believed that the settled agriculture lead to 
sedentary life, land holding and patriliny of inheritance as confirmed by NRY 
chromosomal study (Chaix et al., 2007). In Tamil Nadu until the early Chola period 
the lands belonged to the state and only with the advent of wet land irrigation the 
practice of land holding and male hegemony were established (Sastri, 1975). More 
recent study with the sample size of 1,680 Y chromosomes representing 12 tribal and 
19 non-tribal (caste) endogamous populations from Tamil Nadu suggested that the 


population differentiation particularly the male lineages of this region correlated with 


agricultural expansions predating the varna system (ArunKumar et al., 2012) where 
the mode of subsistence is a major factor in determining the structure of the 
populations. These populations share their genetic heritage dated back to the late 
Pleistocene (10-30 Kya). The coalescence analysis has suggested the establishment 
of social stratification, 4-6 Kya itself and little admixture during the last 3 Kya. The 
study also brought out that this genetic structure was not influenced by the later 
introduced Varna system, as documented by the Brahmin migrations into the area. 
The overall Y-chromosomal patterns correlating with the time-depth of population 
diversifications and the period of differentiation were best explained by the 
emergence of agricultural technology in south Asia as described by Fuller (2007). 
Many of the genomic studies on Indian populations were carried out on 
available samples and no in-depth consideration was given to demographic profile, 
cultural anthropology of the populations studied and sample collection. Further, many 
of these studies considered different strategies in clustering of population’s groupings 
and interpretations, and many times very low sample size were considered. For 
example some study considered Brahmin as a single entity, Brahmin as Dravidian 
language speaker based on the language they speak as of now (Sengupta et al., 2006), 
considering Chenchu as a tribe and based on their M17 clustering with Brahmin in the 
tree interpreted them as having a common origin (Kivisild et al., 2003). An Earlier 
symposium on peopling of India convened by Balasubramanian and Rao (1998) threw 
some glimpses into the subject and on the issues on hand and its importance in 
studying the ‘peopling of India’. Some of the problems with all these studies were to 
investigate in piece meal and trying to draw an over arching conclusion. The 
geophysical barriers, the subsistence pattern, geographical expanse, languages and 


their cultural characteristics all would play key roles in all population genetics 


phenomena such as founder effect, isolation, expansion and dispersal; and all these 
need to be considered holistically. This was the aim of The Genographic project a 
global non profit, non patenting research project funded by National Geogrpahic, IBM 
and The Ted Waitt Family foundation, USA, in which I am a part of the India - 
investigator team that lead to this thesis. 

Mankind is a story of migration. Most of the studies employing uniparental 
markers conform to the coastal route. Earlier studies from this laboratory have 
investigated the Y chromosomes of Tamil Nadu and Kerala by Kavitha (2008). 
Similarly Y chromosome and mtDNA analysis was carried out on populations of 
Orissa and North-East of India (ArunKumar, 2012). These studies have given a clear 
picture of the NRY HG and STR profile of these regions and their correlation to 
subsistence and languages. It was thus of interest to study the other states of the 
Deccan (Karnataka and Andhra Pradesh) and the states on the ‘coastal high way’ to 
India — Maharashtra and Gujarat. Thus studying the people of the Deccan and the 
Southwestern India will throw light on the question of early coastal route entry of the 
people, settlement and expansion. The specific questions thus asked were: 

if What are the parameters that determine the NRY genetic structuring 

of ancient populations of using Gujarat, Maharashtra, Karnataka and 
Andhra Pradesh, 

li. Whether the Tribes and castes, or various tribes and various castes 

among them, of these regions had a common origin? 

ill. Can we explain Dravidian, by studying the populations of Deccan? 

Can its gene pool or language be equated to any of the NRY as 


suggested by Sengupta et al.,(2006) 


iv. What is the distribution pattern of HG J clades that are sporadically 
seen in Indian populations? Does it correlate with any technological 
advancement or culture? 

v. What is the mechanism of caste formation in India? Can it be 
answered by studying indepth a well defined caste such as 
Nattukottai Chettiar? 

Many surprises were thus in store: the study confirmed that the population 
structure of various states was determined by different factors and no unique model 
can be proposed to answer the peopling of the study states. Each state, geographic 
region, language speakers thus need to be considered individually and as a single 
entity to understand various population genetic mechanisms and confounders that lead 
to the present scenario. Unraveling these parameters is essential to deconstruct the 
history of contemporary populations. This thesis has attempted to study the above 


mentioned aspects. 
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2. REVIEW OF LITERATURE 


Human genetics is the study of inheritance of human variations. From the days 
of Gregor Mendel until now, with the advent of genomic era and the outcome of 
Human Genome Project, the identifiable markers have been shifted from pea pods and 
pod colours to SNPs and STRs (Lander et al., 2001). The genome sequences and its 
variations in the population have given ample scope to study the genes and loci of 
medical and evolutionary significance. From evolutionary perspective, the genes and 
genomes have helped in understanding the origin and dispersal of various populations. 
Human population geneticists and anthropologist / archaeologists seized the occasions 
to employ these modern tools to decipher how human populations in various 
continents, countries and regions of the world have diverged from one another and 
how, where and when the origin of our species occurred. Thus the Genomic era was 
ushered by many great discoveries and the development of various technologies like, 

a. Discovery of Polymerase chain reaction (PCR) (Mullis, 1990) 

b. Automated sequencing using flurophores (Olsvik et al., 1993) 

c. Cataloguing the whole human genome sequence as an outcome of Human 
Genome Project and its availability in the public domain (Lander et al., 2001). 

d. Discovery of many evolutionarily significant uniparental markers that are 
unidirectional such as mtDNA and NRY (Cann et al., 1987; Underhill et al., 
1997). 

All these made it possible to have more exact insight into human genetics in 

various contexts. 

In the present study, I have made an attempt to study migrations and isolation 
patterns of selected Indian populations using Y chromosome as a tool. I have selected 


Gujarat and Deccan (Maharashtra, Karnataka and Andhra Pradesh) states to 


investigate the same. In order to better understand the conceptual frame work of the 
subject concerned and to define the origin, expansion and migration of populations in 
the study regions, it has become essential to understand the terrain, archaeology, 
culture and language of the study region. I present a concise review of them and then 
the subject matter. 


2.1 The Land and its People: 


Indian Peninsula is located at the junction of three continents, viz Africa, 
Europe and Asia. Therefore India played a crucial role in housing and dispersal of 
early modern humans, thus enhancing its cultural, linguistic and genetic diversity. To 
address the genetic diversity of the contemporary Indian populations and relate these 
patterns to cultural, linguistic and demographic histories of the people is of great 
importance (Majumder, 1998). The early human habituations in India from 
archaeological, language and genetic study are given below: 


2.1.1 Archaeological time scale of Indian people: 


2.1.1.1 Late Pleistocene and first cultures in India: 


Late Pleistocene (ca 250,000-10,000 years ago) has played a major role in the 
history of South Asia. The earliest known civilization was based on stone tool 
evidences on the banks of Sohan River in the Siwalik Hills and Rawalpind of Pakistan 
(Terra and Paterson, 1939). This is named as Sohainian culture. Following this is the 
Alchelulian culture. This extended from north of Siwalik to Madras (Misra, 1987). 
Tools of Alchelulian culture in Maharashtra and Karnataka has been dated to 
350,000Ybp (Mishra, 1992). Animal hunting was the main occupation. Chopping 


tools, cleavers, scrappers, blades and cores were used for hunting. 


2.1.1.2 Middle and Upper Palaeolithic age in Deccan India: 


Middle Palaeolithic age (ca 20,000 - 42,000 Ybp) relates to the Neanderthal 
remains in Europe, but such evidences have not been found in India. However, stone 
tool evidences have been found in Narmada river basin (Khatri, 1962), Chota Nagpur 
Plateau (Ghosh, 1970), Deccan plateau (Sankalia, 1956) and Eastern Ghats (Murty, 
1966). Traps, nets and snares were probably used during this period. The Upper 
Palaeolithic (ca 32,000 — 14,800 Ybp) tools used by the people of Central Indian and 
Eastern Ghats were bored stones that resemble the net sinkers used by current day 
Yanadi tribe (Andhra Pradesh), nets of Voda Balija (Andhra Pradesh) and other 
fishing communities. Therefore, food procurement during Upper Palaeolithic must 
have been based on aquatic systems. The current study includes the populations from 
Deccan plateau and Eastern Ghats. 
2.1.1.3 Mesolithic and technology advancement: 

The Mesolithic age in India is marked by the microlithic tools, making of 
gums and use of bow and arrow (Wakankar and Brooks, 1976). Bifacial points made 
by pressure flaking are a characteristic feature of the Mesolithic industries of coastal 
dunes of Southern Tamil Nadu (Zeuner and Allchin, 1956) and Sri Lanka. The first 
colonisation of Ganga plains started in this period (Sharma et al., 1980). The nomadic 
lifestyle was reduced to seasonal sedentary life. Disposal of dead (burial grounds) in 
extended and crouched position comes from this period. Man domesticated animals 
such as dog, sheep, goat and cattle. First cultivated plants were wheat and barley. Rice 
cultivation and pig domestication started in Middle Ganga via China. Jerreru valley in 
Southern India showed the technological advances in baking of microblades to give 


forms to it (Petraglia et al., 2007). 


2.1.1.4 Neolithic and the Indus Valley civilization: 


Neolithic ages in India are based on agriculture (8000 — 7000 BC). Mehrgarh 
is the oldest known agricultural settlement in India (Jarrige, 1986). This is located at 
the banks of river Bolan (a tributary of Indus). The history of this site can be divided 
into 8 periods (Willey et al., 2001). Period I-V is marked by polished tools, long 
distance trade for beads and pottery. Terracota human figurines appear in this era. 
Cotton seeds and new breed of barley were identified. Use of timber was also seen in 
the Neolithic sites. Polychrome pottery developed. Period VI showed Pipal leaf and 
humped bull designs, female deity and Shiva that were worshipped. Swastik symbols 
were identified. This culture migrated to Nausharo in third millennium BC (Jarrige, 
1990). 

The population explosion at Baluchistan had forced the people to move in to 
other regions of Indus valley and present dried Ghaggar-Hakra River in fourth 
millennium BC. They also spread to Gujarat, northwest of Rajasthan, Punjab, 
Haryana, west UP, Pakistan and Southern Afghanistan. Scholars believe that the 
Ghaggar-Hakra River was the sacred Saraswati River then, which has been eulogised 
in Rigveda. The change in the directionality of Yamuna and Sutlej rivers could have 
been the reason for the drying of Saraswati River. This eventually led to evacuating 
the sites of Indus to other areas. The other explanations for abandonment of these sites 
could be reduction in rainfall, foreign invasions and environmental degradation due to 
excessive use of soil and plant resources (Saraswat, 1993). Populations that have an 
oral migratory history from Saraswati River basin have been sampled for this study. 

Agricultural crops of Harappans were mainly wheat and barley in Indus; 
millets (bajra, ragi, little millet, Italian millet) in Gujarat. Rice was added as they 


came in contact with civilization in Ganga plains (Weber, 1991) by ca 2500BC 
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(Fuller, 2007). Coastal trades were established in Persian Gulf countries. The social 
and economic stratification appears in the Harappan society with ruling classes, 
farmers, artisan, traders, priests and workers. The dead were buried. The study of 
Indus script remains ambiguous to date. These scripts are pictographic signs on seals 
and tablets. It was written right to left and in some cases boustrophedonically 
(Parpola, 1994; Possehl, 1996). Parpola (1994) suggested that the Harappan 
inscriptions are mainly Dravidian. Elamites (proto Dravidian language macro family) 
and Dravidians have shown to be highly related (McAlpin, 1981) in terms of their 
vocabulary. 


2.1.1.5 Neolithic farming and traditions in North and North east India: 


The farming communities emerged outside the Indus and Harappan sites. The 
Neolithic sites were restricted to Kashmir valley, north Vindhya, Ganga valley, south 
India, east and northeast India. The tools developed in Kashmir valley were unique of 
its kind and was found only in north China Neolithic sites. The main crops were 
winter wheat, barley, lentil and peas (Kajale, 1991; Lone et al., 1993) which were 
probably derived from Near east. Animals like cattle, goat and sheep were 
domesticated. Hand and wheel made pottery developed as those similar to Pakistan. 

Neolithic in north Vindhyas and Ganga valley were crucial as it was meeting 
ground for Indo-European (IE), Austro-Asiatic (AA) and Dravidian (DR) speakers 
(Misra, 2001). The communities in Vindhyas mainly practised shifting or plough 
cultivation. The Ganga plains with alluvial soil supported agriculture by 3000BC 
(Costantini, 1987; Misra, 2001). Hunting groups assimilated themselves into 
agriculture based society. Rice, millets, monsoon pulses and winter crops were grown. 
At the end of third millennium or second millennium BC, east India grew wild 


varieties of rice, millets, pigeon pea, lentils and pulses (Fuller, 2007). Artefacts 
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related to fishing are also seen. Northeast India cultivated yams and taros. The AA 
speakers rose wooden memorials for the dead. 
2.1.1.6 Neolithic and farming in Deccan: 

In Sourashtra, agriculture was mainly dominated by millets native to India. By 
ca 2000-1700 BC, crops from Africa such as sorghum, pearl millet and finger millets 
were introduced. Pulses and legumes were also chiefly crops. The static frontier in 
which agricultural groups interact with hunting-gatherer groups for trade is best 
inferred for Gujarat and Rajasthan (Fuller, 2006; Fuller, 2007). 

Neolithic sites in south India are found in north Karnataka, west Andhra 
Pradesh and north Tamil Nadu. Many of them occur on the flat tops, slopes and foot 
of granitic hills but some are also found on the alluvial banks of rivers like the 
Godavari, Krishna, Penneru, Tungabhadra and Kaveri (Paddayya, 1973; Murty, 
1989). In south, Neolithic age is marked by ash mounds. The economy was chiefly 
agro-pastoral. Few non-ash mound sites have been identified in south Karnataka 
(Fuller, 2006). The ethno historical data suggests that animal herd keeping, the chief 
occupation among populations such as Gollas of Andhra Pradesh, Kurubas of 
Karnataka and Dhangars of Maharashtra, started during this age (Murty, 1989). These 
populations have been included in this study. In third century B.C the earliest 
evidence of writing in Tamil Sangam literature were found in South India. Burial was 
the means to dispose the dead. 
2.1.1.7 Chalcolithic age in Deccan: 

In India, Chalcolithic evolved parallel to Neolithic age. Copper-bronze 
discovery led to the improvement of tools, weapons, ornaments, architecture and 
pottery. The Chalcolithic cultures were found in west and central India, Rajasthan, 


Malwa, Vindhyas and Ganga valley. The Northern Deccan or Western Maharashtra, 
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particularly the semi-arid region to the east of the Sahayadris has provided the best 
evidence of the Chalcolithic cultures in India (Dhavalikar, 1997). The Malwa culture 
(movement of Malwa people from Central India with agriculture) and Jorwa culture 
(agricultural colonisation) are characteristic features of Chalcolithic Maharashtra. 
Buddhism and Jainism prevailed during this period. The first Indian empire — 
Magadha arose. With the introduction of Iron, the cultural development shifted to 
south India. Black and red ware pottery and painted grey ware (PGW) are the main 
features of Iron Age (Agrawala, 1989). PGW were first found at Ahicchatra in 
Bareilly district of Uttar Pradesh (Ghosh and Panigrahi, 1946). Populations with the 
oral history of migrating from this region have been sampled for the study. In the first 
millennium BC, Megalithic culture developed. 
2.2 Origin/Spread of People and languages in India- evidences from linguistic 
studies: 

In India there are four main language families with strong regional affiliations. 
These language families are IE, DR, AA and Tibeto-Burman (TB).The present study 
includes populations from IE, DR and a few AA speaking populations to test its 
correlation with NRY genes and migratory patterns of my study populations. The 
hypotheses made by other linguistic and genetic based studies are mentioned below: 

Two theories have been proposed on the spread of these languages namely the 
“wave model” (Ammerman, 1984) and “elite dominance” (Renfrew, 1988) model. 
According to wave model, agricultural surplus produced lead to increase in population 
density over hunter-gatherer community. This mechanism took the languages and 
genes into other areas. In ‘elite dominance’ model, language of a small invading 
group is adopted by a large resident population. This language shift occurs either by 


force or owing to its social advantages. 
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2.2.1 Indo-European languages: 


There are several theories on the spread of Indo-European languages into 
India. Anatolian theory claims that IE languages spread from Anatolia with 
agriculture ~8000-9500 years ago (Gray and Atkinson, 2003; Bouckaert et al., 2012). 
Kurgan theory explains that the warriors of north Black sea invaded Europe between 
4300-2800 BC and imposed their language on Europeans (Mallory, 1989). The 
astronomical reference in Vedic literature shows the presence of IE speakers in fourth 
millennium BC or earlier, thus making India another probable homeland of IE 
speakers. The spread of IE speakers into south India has been associated with settled 
agriculture and irrigation technologies (Sastri, 1975). 

Y chromosome studies state that the caste populations in India are mainly 
derived from Indo European speakers who migrated from Central Asian origin 
~3.5kya (Cordaux, 2004). The presence of Y HG R-M17 in a frequency of 40% of in 
caste populations of India and Central Asia with a relatively low frequency (9%) in 
tribal populations supports this view. Whereas, studies by Sharma et al., (2009) 
suggested autochthonous origin of this HG. 


2.2.2 Dravidian languages: 


Robert Caldwell was the first to use the word ‘Dravidian’ in 1856 and to 
propose an autochthonous origin of Dravidian, contrary to the widely held view of 
Sanskrit origin of Dravidian languages. He also has suggested an affinity of Dravidian 
languages to Scythian languages. Different school of thoughts on the origin and 
spread of Dravidian languages exist these are: 

(i) Proto-Ealamo Dravidian languages spoken by Elam carried their language and 
agricultural technologies from Zargos Mountains in south western Iran to India 


(Menozzi et al., 1994; Renfrew, 1996). This theory proposed that the Central Asian 
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pastoral nomads moving with IE language to Iran, Pakistan and north India ~4000 
Ybp by elite dominance process (Renfrew, 1988). 

ii) Based on Lexical reconstructions of flora, place names, modern language 
geography and archaeological evidences, Parpola, (1994) provided evidences for 
proto-Dravidian origin in the Indus region. Brahui, a Dravidian speaking isolate in 
present Pakistan is known to have migrated from north Dravidian region in Central 
India in the past millennium as evidenced by their vocabuary (Elfenbein, 1987; Fuller, 
2007). Alternatively it is also hypothesised that Brahui is the remnanat of the 
widespread Dravidian language that was eventually replaced by the influx of IE 
speakers into India. 

(iii) Scientists working with genomic tools have proposed an autochthonous origin of 
Dravidian speakers in Southern India (Sengupta et al., 2006). 

2.2.3 Austro Asiatic language: 

The AA languages are mainly divided in to two branches namely: Munda and 
Mon-khmer. Archaeological evidences such as rice domestication and linguistic 
evidences support the southeast Asian origin (“Britannica Online Encyclopedia,” 
2012; Diamond and Bellwood, 2003). Language studies based by Witzel (2005) 
proves east India as the birth place for AA speakers. Genetic study by Basu et al, 
(2003) and Majumder, (2001) show that AA speakers had indigenous origin in India. 


2.2.4 Tibeto Burman languages: 


The TB speaking populations mainly occupy the north east region and 
Himalayas of India. Archaeological records state that these populations arose 5000- 
6000 years ago (Guha, 1936). The age estimate is consistent with the Y STR data of 
Su et al., (2000) from Yellow river basin, China. TB speakers probably entered India 


through multiple routes along the Himalayas carrying YAP lineages into India 
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(Sahoo et al., 2006). TB speakers migrated with the female population from Burma 
with Naga-Kuki-Chin languages and Y HG O3¢ into the subcontinent. 
2.3 Castes and tribes of India 

The “Caste System” in India is a unique phenomena characterised by inbreeding 
and endogamy. Their origins remain highly controversial to date. Ethnographic and 
genetic evidence both support that castes system in India have been highly 
endogamous for a considerable length of time (Karve, 1968; Bhasin and Shampa, 
1994).They are mainly associated with agriculture and found concentrated in the 
alluvial and the coastal plains of the country. Whereas the tribes who constitute 8% of 
the total Indian population occupy mostly the hilly and the forested tracts (Bhasin, 
2006). 

Kivisild et al., (2003) reported that tribes and castes share considerable 
Pleistocene heritage, with limited recent gene flow between them, whereas Cordaux, 
(2004) observed that caste and tribes may have independent origins. 

But the issues concerning the antiquity and past genetic history of the tribal 
populations and the confounding influences of region, language, and ethnicity have 
remained elusive (Krithika et al., 2009). There study further addressed the issue by 
proposing different models: 

a. The derived tribes had retained their common population name and language 

from the early settlers. 

b. The derived sub-tribes had retained a common ancestry but acquired different 

languages. 

c. Sub-populations derived from (two) different ancestry had retained their 


separate ethnicity but adopted a common language 
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Hence this gives the possibility that different tribes or caste 
populations may have same or different origins. Indepth analysis of 
ethnographic details, language shifts along with the genetic data also need to 
be considered to decipher their origins and antiquity. 

2.4 NRY chromosome as a tool to study Population histories: 

Several biological markers such as blood protein polymorphism, HLA, 
mitochondrial DNA and Y chromosome have been used to infer population histories. 
Of these Non Recombinant Y (NRY) chromosome is the best suited for deciphering 
the migratory patterns of male lineages for various reasons. 

The Y chromosome is one of the smallest chromosomes in the human genome 
(~60Mb) and represents 2-3% of a haploid genome evolving from a pair of autosomes 
around 300 million years ago (Lahn et al., 2001). Y chromosome determines the 
human sex and maintains the male germ cell. 95% of the Y chromosome is Non 
Recombining (NRY). It consists of several repetitive DNA sequences. These repeat 
sequences are organised as palindromes with two very long similar sequences 
pointing in opposite directions and joined by spacer (Charlesworth, 2003). 


2.4.1 Structure of Y chromosome: 


Chromosome banding techniques have revealed three main regions on the Y 
chromosome namely the pseudoautosomal (PAR) portion, euchromatin and 
heterochromatin regions. The PAR is divided into two i.e., PARI and PAR2. Fig 2 
gives the schematic representation of Y chromosome. PAR 1| is located at the terminal 
region of short arm (Yp) whereas PAR2 is located at the tip of long arm (Yq) which 
spans approximately 2660 to 320kb of DNA respectively. PARI exchanges its genetic 
material with the X chromosome during meiosis. Deletion of PAR | results in male 


sterility and failure of pairing during meiosis (Helena Mangs and Morris, 2007). 
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Figure 2: Schematic representation of Y chromosome showing the position of 
YSTR markers employed in this study 
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Distal to PAR is the euchromatin region spanning 23Mb of the male specific Y 
(MSY) (Skaletsky et al., 2003). It has three main groups of genes: X transposed 
region, X-degenerate region and Amplicon region. The X transposed region is mainly 
populated by Alu, retroviral and Long Interspersed Elements1(LINE 1). Alu markers 
find their application in population genetic studies. The X degenerate region and 
Amplicon region are responsible for maintaining the normal biological functions of Y 
chromosome. 

Distal Yq region is the heterochromatin region contains numerous highly 
repetitive DNA sequences and also genes responsible for the biological functioning of 
Y chromosome. They also house several genetic markers used in population genetic 
study (de Carvalho and Santos, 2005) because they house highly repetitive sequences 
DYZ1 and DYZ2. 

2.4.2 Mutations of interest on Y chromosome: 

There are two types of mutations on the non coding regions of Y chromosome 
that accumulate in course of time and are stable. These mutations include biallelic 
polymorphism or unique event polymorphisms (UEPs) and Short Tandem Repeats 
(STRs) (Jobling and Tyler-Smith, 2003). 

The biallelic polymorphisms are slow mutating markers representing single 
nucleotide polymorphism (SNP), insertion/deletions (indels) and Long Interspersed 
Elements (LINE) insertions. The first biallelic marker to be identified was Alu 
insertions (YAP) present in majority of African populations but absent in European 
populations. Y haplogroup can be defined as all the male descendants of a single 
person who first showed a particular SNP mutation. They characterise the migration 


of population groups. 
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The long arm of Y chromosome contains large interspersed tandemly 
repeated arrays called Y STRs. These are also called microsatellites. Mutation in the 
microsatellites occurs due to polymerase slippage during DNA replication (Lodish, 
2000) . The chromosomal mapping of these STRs on Y chromosomes that are 
included in this study is shown in Fig. 2. 

DYS19 was the first identified polymorphic Y marker. A core of YSTR 
markers referred to as minimal haplotype includes DYS19, DYS389I/II, DYS390, 
DYS391, DYS392, DYS393, and DYS385 a/b (Butler, 2001; Kayser et al., 1997; 
Roewer et al., 2001). Three models have been proposed for the origin of diversity in 
Y STRs. Firstly, the “stepwise mutation model” (SMM), wherein the mutation events 
involve gain or loss of one repeat unit resulting in expansion or contraction 
respectively (Ota and Kimura, 1973). The product of the mutation is often an already 
existing allele. This implies that the two alleles had a common ancestor. It is the most 
preferred model in calculating genetic relatedness between individuals or populations. 
However this model has a drawback of homoplasy - the phenomena when two alleles 
are identical in state and not identical by descent, leading to underestimation of 
divergence. The second model “Infinite allele model” (IAM), which states that every 
mutation generates a new allele. A particular locus is same in two different 
individuals if no mutations have occurred. The probability that these two individuals 
had the same ancestor is Exp[-2ut] where p is the assumed mutation rate and t is the 
time in generations. This model employs scoring the loci as match or no match. The 
disadvantage of this model is that it could underestimate TMRCA because of the risk 
of undercounting total number of mutations. Thirdly, the “K allele model”, that states 


that microsatellites can mutate to “K” alleles randomly. 
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2.4.3 Mutation rates: 


In population genetics studies, dating of the Y lineages and demographic 
events are based on the knowledge of mutation rates. Mutations on Y chromosome are 
random and obey first-order kinetics. The genetic diversity of a loci is a function of 
mutation rate and effective population size (Burgarella and Navascués, 2011). 
Absolute mutation rates have been estimated by pedigree analysis or by Y 
chromosome microsatellite variation within a Y HG. 

In pedigree analysis, direct count of deep root lineages yielded a mutation rate 
of 2x10° per generation (Heyer et al., 1997). Study by Kayser et al., (2000) on 
father/son pairs yielded a mutation rate of 3x10” per generation. A study of 18,000 
DNA sequences from sperm cells showed a mutation rate of 2x10° for YSTRs, 
DYS19 and DYS390 (Holtkemper et al., 2001). Pedigree based mutation calculations 
are based on per-meiosis. 

Counting the number of mutations in the branches of median network of 
native American population by (Forster et al., 2000a)), identified the difference 
between evolutionary and pedigree mutation rates. Pedigree based age estimates gave 
a lower age estimates. The discrepancy between these two age estimates could be due 
to the use of fast type markers and the age of the samples used in pedigree analysis ( 
>30 years). Mutation rates are known to increase with age. The current day paternity 
age may not reflect the prehistoric fathers. 

Zhivotovsky et al., (2004) estimated the effective mutation rate using the 
YSTR data within a Y HG. The value was found to be 6.9x10™ per 25 years. This 
value was used to estimate the expansion times of African Bantu population, 
divergence of Polynesian populations and origin of Gypsy population in Bulgaria. 


Evolutionary mutation rates are based on current variation in microsatellites that has 
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been influenced by reverse mutations of old alleles and forward mutations to new 
alleles. This mutation rate has been widely used in evolutionary studies including the 
present study. 
2.5 Nomenclature system for YSNPs and YSTRs 

With the increase in the number of binary markers discovery, different 
systems have been used to name these SNPs. There are seven different types of 
nomenclatures (Jobling and Tyler-Smith, 2000; Hammer et al., 2001; Underhill et al., 
2000; Karafet et al., 2001; Semino et al., 2000; Su et al., 1999; Capelli et al., 2001). 
Therefore to develop a uniform method of naming these SNP, Y chromosome 
consortium laid down the regulations in 2002 (Consortium, 2002). The combination 
of binary polymorphism yielded a phylogenetic tree based on maximum parsimony 
method. Capital letters A-R were used to identify 18 major clades. This is followed by 
the name of the terminal mutation that defines the haplogroup (HG). Lineages that 
were not defined by derived mutation were placed at the interior nodes of the tree. 
These are referred as paragroups and is indicated by the symbol * (Karafet et al., 
2008). Subsequently, Jobling and Tyler-Smith, (2003) revised the YCC 2002 tree to 
include all the markers that were discovered after 2002. Later Karafet et al., (2008) 
revised this nomenclature with clades from A-T. This is the most commonly used 
tree. The International Society of Genetic Geneology (ISOGG), a non-commercial 
organisation formed in 2005 frequently updates the Y chromosome phlogenetic tree. 
The phlylogenetic tree is shown in Fig. 3. 

The Human Genome Organisation nomenclature (HUGO, 2012) laid down the 
standardized YSTR nomenclature regulations which are mentioned below: 


1. All YSTRs are represented by DYS following the name of the STR. 
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Figure 3: Phylogenetic tree of Y chromosomal haplogroups 
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Alleles should be named based on the number of variant and non-variant 
repeats from sequence data. Single repeat units located next to the main repeat 


motif with the same sequence should be considered as the part of the repeat. 


. Repeat units that are not adjacent to main repeat motif and has less than three 


units with no size variation (in humans or chimpanzee) should not be 
considered for nomenclature. 

Intermediate alleles (eg 11.1) should be represented by the number of 
complete repeat units and the number of bp of the partial sequence separated 
by decimal. 

Intermediate alleles formed by mutations in flanking region that can alter the 
allele length should be represented by the number of full repeat units followed 
by direction and position of the mutation relative to STR. 

Point mutations that affect PCR annealing should be verified by sequencing 
and designations must be used to represent them as per the guidelines 

New sequence variation should adapt to locus delimiting criteria 

Journal editiors, reviewers and organisers should use the standardised 
nomenclatures to assure the uniformity of nomenclature usage 


Commercial Yfiler kits also should follow the standardised nomenclatures. 


2.6 Evolution of NRYHG markers and its global distribution: 


The global distribution and their proposed route of migration of various Y 


haplogroups prevalent in India are shown in fig. 4a-g. Accordingly, selected HG is 


seen in high frequencies in a given geographical region or population. The 


unidirectional evolution of NRY HG has thus facilitated to suggest the human 


occupation in various parts of the world. Summarized below is the existing 
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knowledge on this aspect. Table 1a-1f shows the percentage frequencies of various 
haplogroups for 13,779 Y chromosomes from 131 geographical areas. 

2.6.1 HGs with African Origin: 

NRY HG A: defined by M91 mutation is completely restricted to African 
subcontinent (Hammer et al., 2001; Underhill et a/., 2001) and found most frequently 
in Khosian population. The coalescence age of NRY root (Cruciani et al., 2011) was 
estimated to be 142 Kya, thus surmising the earliest common ancestor of all humans 
originating / found in Africa. This mutation is not found anywhere else in the world. 
NRY HG B: This haplogroup is also restricted to Sub-Sahara African continent in 
populations such as Pygmies and Baka (Berniell-Lee et al., 2009). The age of this 
marker is ~50,000-60,000 Ybp. It is the second oldest marker following 
haplogroup A. 

2.6.2 Phylogeography in south Asia 

NRY HG D: This haplogroup is mostly present in Northern and Eastern Asia, 
frequent in Tibet and Japan and is present in lower frequencies among Southeast 
Asians and Andamanese (Shi et al., 2008). The age of expansion of this HG was 
about 60,000 Ybp. Relic distribution of this HG in East Asia is attributed to the spread 
of Han culture and last glacial maximum. 

NRY HG C: Fig 4a shows the distribution of NRY HG C in global populations and 
its migration pattern. This clade is characterised by its first migrants into India (Wells 
et al., 2001). Paragroup C*-M28 is higher in East Asia but also distributed in other 
regions in low frequencies. Japanese are specific to YHG C1-M8. Y haplogroup C3- 
M217 and its subtypes are extensively distributed in East Asia, Central Asia and 
Siberia. This clade is considered to have a Mangolian origin (Zhong et al., 2010). The 


subtypes of NRY HG C3 show north to south cline. NRY HG C5-M356 specific 
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Table -1a: Percentage frequencies of NRY HG C and its subclades with increasing longitude 
(West to East) 
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Table-1b: Percentage frequencies of NRY HG F and its subclades with increasing 
longitude (West to East) 
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Figure 4a:Global Frequency distribution map of NRY HG C and its sub clades 
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Note: Each NRY HG sub clade is represented by different colour. The size of the pie is propor- 
tional to the frequency of the haplogroup. The colour keys for each haplogroup is shown in the 
box provided in the figure along with the age of the haplogroup obtained from various litera- 
ture sources listed in the table 1a-1g. 


lineages are found in India and had an insitu origin in India (Sengupta et al., 2006). 
The YSTR diversity is found to be highest in Austronesian populations. The proposed 
route of migration of M130 into Southeast Asia was via Indian coast and Australia 
during Palaeolithic ~SOKya. The age estimates (fig. 4a) also further support this 
migration route. These geographically specific haplogroups have undergone long term 
isolation (Zhong et al., 2010). 

NRY HG F: Fig 4b shows the frequency distribution of F lineages that are mainly 
present in East Asia. Paragroup F* is observed mainly in India. F*-M89 is known to 
have higher STR variance patterns in Tamil Nadu and Andhra Pradesh near coastal 
eastern India (Sengupta et al., 2006). The Dravidian speaking, Nilgiri Hill Tribe 
Foragers (HTF) populations of Tamil Nadu had long term STR _ evolution 
(ArunKumar et al., 2012). The TMRCA is estimated to be 29,344 Kya for F*-M89 in 
this study. In contrast, the F* populations of Orissa seemed to have limited STR 
evolution (ArunKumar, 2012). Northeast populations showed very low frequency of 
this HG. All these provide the evidences for autochthones origin and evolution of F*- 
M839 in Southern India. Whereas F2-M427 and M428 are found restricted to Lahu, an 
Sino-Tibetian language speakers of China (Sengupta et al., 2006). 

NRY HG Hi: The distribution of HG H and its subclades are mainly restricted to 
Indian (Fig 4c). HG H1 spatial frequency maps from the study of Sengupta et al., 
(2006) suggests the high STR variances towards Maharashtra region in coastal 
Western India and Y HG frequency in Eastern India. The study also revealed that Y 
HG H2-Apt show high STR variances towards Eastern coastal India and could have 
had insitu origin in India. A study by Trivedi et al., (2008) revealed the highest 
gradient is towards west India (44.4%). In Orissa populations, H-M69 and Hla-M82 


were present mainly in Dravidian speaking populations (ArunKumar, 2012). In Tamil 
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Table-1c: Percentage frequencies of NRY HG H and its subclades with increasing 
longitude (West to East) 
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Table-1d: Percentage frequencies of NRY HG L and its subclades with increasing 
longitude (West to East) 
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Figure 4c:Global Frequency distribution map of NRY HG H and its sub clades 


YHG Age | 

| 30,400 
10,600) 
13,180 


* ™ « ot 2 the 


Note: Each NRY HG sub clade is represented by different colour. The size of the pie is propor- 
tional to the frequency of the haplogroup. The colour keys for each haplogroup is shown in the 
box provided in the figure along with the age of the haplogroup obtained from various litera- 
ture sources listed in the table 1a-1g. 


Nadu populations, Nilgiris Hill tribes speaking Kannada dialect (HTK) showed higher 
frequency (42.5%) and age (42.52 KYA) (ArunKumar et al., 2012). The YSTR based 
coalescence time by this study was ~43,556 years. Further study is essential to 
decipher the centre of origin of this haplogroup within India. 

HG H-M69 is also found in populations of Afghanistan such as Phastuns and 
Tajiks. The presence of this haplogroup in these populations is suggested to be due to 
gene flow from India to Afghanistan (especially H-M69, L-M20 and R2 M124) 
during Indus civilization or Bactria-Margiana archaeological complex as suggested by 
Haber et al., (2012). Roma gypsies in Western Europe have their founding lineages in 
India. They are the main source of Y HG H in Europe. Table 1c shows the haplogroup 
frequencies in various geographical areas. 
NRY HG L: Y HG L is found mainly in Indian subcontinent and Pakistan (Fig 4d). 
However it is present in lower frequencies in Middle East, Central Asia, Northern 
Africa and Mediterranean coast. The sub clades of L namely L1-M27/76, L2-M317 
and L3-M357 have distinct geography. NRY HG L1-M27/76 and L3-M357 are 
present in Indian and Pakistani populations respectively and nearly absent in Turkey 
and surrounding areas, suggesting a distinct founder in these regions (Thanseem et al., 
2006). The study by Sahoo et al., (2006) showed absence of NRY HG LI in east 
Indian populations, and associated it to be geographic rather than language. Whereas 
the study by Trivedi et al., (2008) based on YSTR showed that Dravidian speakers 
harboured higher proportion of NRY L as compared to Indo-European speaker. By 
increasing the phylogenetic resolution, Sengupta et al., (2006) reveals the early 
diversification of HG L1-M76 among Dravidian speakers during early Holocene 
(~9Kya). The STR variance of HG L1 in south India is higher compared to that of 


west India. HG L1 is nearly absent in northeast and Orissa (ArunKumar, 2012). 
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Further study on HG LI in Tamil Nadu populations showed the affinity of HG L1 
with dry land farmers (ArunKumar ef al., 2012). All these provide clues for insitu 
origin of HG L1 among Dravidian speaking populations that practised farming in 
south India. 

Considering the movement of pastoral groups via Turkey, Hindu Kush, 
Afghanistan and north India, HG L3 would be expected to be seen in south India. 
Contrary to this its frequency is only ~ 0.8% in south India. However, comparison of 
the six YSTR loci of four Chenchu tribe with Lambadis, Punjabis and Iranians 
showed considerable sharing (14-12-22-10-14-11). This haplotype differs from 
Armenia M20 chromosome by three step modal haplotype (15-12-23-10-13-11) 
(Weale et al., 2001). L2a is generally regarded as the “Mediterranean” as present 
mainly in Turkey. This haplogroup is also found in Parsi and Oran population of India 
(Genographic data) in lower frequencies. NRY HG L3-M357 is mainly localised in 
Pakistan populations. Interestingly a new subclade L3a-PK3 was identified in Kalash 
populations (23%) of Pakistan clustered with Yadava population of Tamil Nadu with 
TMRCA of 1400-8100 YBP (Mohyuddin et al., 2006a). Thus the study of HG L clade 
globally would provide more insight into the existing controversy of migration pattern 
of this clade. 

NRY HG O: The spread of NRY HG O2a-M95 has been associated with the spread 
of Austro Asiatic populations (Fig 4e). Kumar et al., (2007) suggested that the 
Mundari populations have been the source of O2a-M95 around 65,000 YBP. This 
view 1s supported by the absence of this clade in other parts of India. The HG 
frequency and YSTR variance was found to be higher in Mundari speakers as 
compared to southeast Asian populations. Whereas, the HG O3 lineages were 


concentrated among TB speakers. The study by Trivedi et al., (2008) concludes that 
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Table-1f: Percentage frequencies of NRY HG R and its subclades with increasing longitude (West to East) 
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Figure 4e:Global Frequency distribution map of NRY HG O and its sub clades 
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Figure 4f:Global Frequency distribution map of NRY HG R and its sub clades 
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Note: Each NRY HG sub clade is represented by different colour. The size of the pie is propor- 
tional to the frequency of the haplogroup. The colour keys for each haplogroup is shown in the 
box provided in the figure along with the age of the haplogroup obtained from various litera- 
ture sources listed in the table 1a-1g. 


the NRY HG O lineages in India had a Southeast Asian origin. These lineages arrived 
at different times, as no HG O3e lineages were found in AA speakers. Other study by 
from Genographic India-China study gave evidences for origin of AA speakers from 
Laos with the TMRCA of 64.2 Kya. This migration was mainly male mediated. 


Distribution of NRY HG _R: NRY HG R and its sub clades are geographically 


widespread. HG R-M173 is considered as the ancient marker that arose first in Homo 
sapiens sapiens in Eurasia (Al-Zahery et al., 2003). Northern Camaroon in Africa is 
the one of the population from Africa which represents R-M173 mutation in a 
frequency of ~40%. The origin and spread of the sub- clades, especially NRY HG 
Rlala-M17 gave rise to different school of thoughts. Study by (Wells et al., 2001) 
suggested that NRY HG M17 lineages and their YSTR diversity is found to be highest 
in Central Asia (South Russia/Ukraine) and could be the probable origin of this 
marker. Another study by Sharma et al., (2009) showed that the age of Rlala-M17 in 
Indian populations were much older than that of Central Asian populations, thus 
supporting the Indian origin of Rlala-M17. There study suggests that high frequency 
of Rlala-M17 in Brahmin populations could be the founder of this haplogroup 
irrespective of their linguistic and geographical affiliation. This further supports the 
formation of caste system in India. The study from Genographic India showed that 
NRY HG Rlala-M17 was significantly high in Indo-European populations, 
irrespective of their geography. It suggested that this clade could have the ancestors of 
HG Rlala-M17 from India accounting to low effective population size, high YSTR 
variance, high mean pair wise difference as compared to the other global populations. 
NRY HG R2 -124 is present in higher frequency in India and deeper age estimates 


suggests Indian origin during late Pleistocene (Cordaux, 2004: Trivedi et al., 2008). 
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2.6.3. Major Y haplogroups of Middle East and Caucasus: 


NRY HG G: This clade is mostly present in Middle East, Mediterranean and Caucasus 
region. Semino et al., (2000) suggests that the haplogroups J and G had a common 
ancestry. Populations speaking northwest Caucasian languages show high frequency 
of NRY HG G-M201(Nasidze et al., 2003). The lineages of G are known to correlate 
with the archaeological areas of Bronze age Hattic and Kaska cultures (Cinnioglu et 
al., 2004a). Rootsi et al., (2012) studied 16 informative G clades in the populations of 
Europe to Pakistan and associated NRY HG G and NRY HG J2 to the spread of 
agriculture in Europe. The study also proposes that the homeland of this NRY HG 
could be Anatolia, Armenia or Western Iran. 

Y AG J: Quintana-Murci et al.,(2001) suggested that NRY 12f2a spread to India 
during Neolithic period with farming technology thereby indicating entry of Indo- 
Aryan migration into India through the Western corridor. There study showed that the 
microsatellite variation of J-M172 is higher (0.947) when compared to J (xM172) 
(0.844). Hence J-M172 is an older marker in comparison to J (kKM172). There study 
hypothesises that J-M172 could have expanded in the Northwest of Fertile Crescent 
and spread along with agriculture. Whereas J (xM172) must have its centre on the 
Eastern side of Fertile Crescent and expanded into Arab populations. J1-M267 is 
present in the frequency of 9% in Turkey (Cinnioglu et al., 2004a) with short DYS388 
allele with 13 repeat units. The study by Sengupta et al., (2006) suggest the eastward 
expansion of NRY HG J2a-M410 with agriculture and painted pottery into Indus 
valley during Neolithic period. The Y STR based age estimates of Y HG J2a-M410 


and J2b2-M241 exceeds the age of agriculture in India i.e., 6 Kya. 
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Table-1g: Percentage frequencies of NRY HG J and its subclades with increasing longitude (West to East) 
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Figure 4g:Global Frequency distribution map of NRY HG J2a and its sub clades 


Note: Each NRY HG sub clade is represented by different colour. The size of the pie is propor- 
tional to the frequency of the haplogroup. The colour keys for each haplogroup is shown in the 
box provided in the figure along with the age of the haplogroup obtained from various litera- 
ture sources listed in the table 1a-1g. 


2.7 Previous studies form this laboratory: 


This laboratory has been actively involved in the study of Human Leucocyte 
Polymorphism (HLA) for the past thirty years. The study designs are based on the 
complex inbreeding units that exist in India. HLA DRB1* and DQB1* has been found 
to be specific in Piramalai Kallar and Yadava populations respectively of Madurai 
(Shanmugalakshmi et al., 2003). This study also suggests that endogamous units, 
sympatrically isolated castes or well defined breeding isolates that live under the same 
mileu-epidemiology, may be ideal models to test immunogenetic basis of disease. 
Pitchappan, (1998) brought out the differences in the distribution of HLA haplotypes 
in Indian and Caucasian populations. 

The leprosy-affected sib-pair studies by whole genome microsatellite mapping 
identified the susceptibility loci at 10p13 (Siddiqui et al., 2001; Tosh et al., 2002). 
This has been mapped to the disease in C20 families from Tamil Nadu, but absent in 
neighbouring state- Andhra Pradesh, thus reflecting the importance of community 
genetics in genomic era. All these studies indicate that for any case control study, the 
controls have to be matched with age, sex and caste for appropriate comparison 
(Pitchappan, 2002). The study on Eurasian populations gave evidences for the first 
coastal human migration from Africa to Australia via the Indian subcontinent (Wells 
et al., 2001). 

NRY studies on 31 populations of Tamil Nadu, suggested that both caste and 
tribal populations had overwhelming frequencies of H-M69,F-M89, Rlala-M17, L1- 
M27, R2-M124 and C-M130. These lineages date back to late Pleistocene in these 
populations. The West Eurasian contribution has been <20% in Y lineages. A strong 


genetic structure has been identified to be associated with mode of subsistence of the 
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study populations. The social stratification was found to be established by 4-6Kya, 
predating that establishment of Varna system. 

Another study on the populations of Orissa and North east India showed a 
strong correlation between NRY and language. NRY HGs such as F*, H, Hla were 
more predominant in Dravidian speaking populations of Orissa. Whereas, the Austro 
Asiatic speakers possessing HG O2a-M95 migrated through northeast corridor from 
Laos. Laos could have been the probable geographical origin of HG O2a (~64.2Kya) 
(ArunKumar, 2012). These migrations were male mediated and no mtDNA genetic 
resemblances were found in India. These migrations may be coupled with the practise 
of shift cultivation. The NRY (HG Rlala) and mtDNA (M*, M6 and U*) 


composition of Indo European speakers suggest their autochthonus origin in India. 
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MATERIALS AND 
METHODS 


3. MATERIALS AND METHODS 


3.1 Sampling: 


A total number of 2,522 healthy male volunteers belonging to 33 castes and 18 
tribal populations from Andhra Pradesh (N=774) (latitude: 17.047762 longitude 
80.098187), Karnataka (N=877) (latitude: 15.317277 longitude: 75.713888), 
Maharashtra (N=458) (latitude: 19.751480, longitude: 75.713888) and Gujarat 
(N=413) (latitude: 22.258652, longitude: 71.192380) were enrolled and sampled 
either in their household or in public places. In addition, 170 samples of Nattukottai 
Chettiars from Chettinad of Tamil Nadu (latitude:11.127123, longitude:78.656894) 
India, were collected in two of their community congregations. The ethnographic 
details of these populations are given in appendix 1.The volunteers were all above the 
age of 18 and written informed consent (Appendix 2) was obtained, witnessed by 
local interpreter/community leader. The choice of the populations to be sampled was 
based on the advice of the anthropologists and genetists. The study populations were 
selected based on their uniqueness, antiquity and population size. The list of all the 
advisors and collaborators who assisted the Genographic team and the work load 
shared by the members of the laboratory are given in Appendix 3. The sampled 
locations as co-ordinates, caste / tribe names and N collected are shown in Fig 5. 

The current study is a part of The Genogrpahic Project-India. Ethical clearance 
for the study protocols were obtained from Madurai Kamaraj University, Madurai. 
Necessary permissions from local government bodies, village heads and educational 
institutions were obtained before sampling. The volunteers for sampling were 
approached through a local contact. The purpose and methodology of sampling was 
explained to the volunteers, head of the institution or village head in their local 


dialect. On their approval, questionnaires were filled by the volunteers (Appendix 4, 
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5). For the samples collected at each location, sampling team document and village 
document were filled (Appendix 6,7). The demographic details of the studied 
populations are given in table 2. 


3.2 Topography of Sampled locations 


3.2.1 Gujarat 


Gujarat is the north western most state of India, over the Arabian Sea (Fig 1). 
It covers a land mass of 1,96,030 Km”. It borders with Pakistan, and Rajasthan to the 
North, Madhya Pradesh and Maharashtra to the East, Arabian Sea to the South and 
West. The land mass of Gujarat is divided in to three regions: Peninsular Saurashtra, 
Kutch and Gujarat corridor. Sir Creek (96km strip of water) demarcates Pakistan from 
Kutch of Gujarat. The Gulf of Kutch divides the Kutch region from Sourashtra, and 
the Gulf of Khambat separates the Sourashtran region from the southern corridor of 
Gujarat. River Narmada forms one of the traditional barriers between North and South 
India. It drains into the Gulf of Khambat. This river basin covers 14% of land in the 
state of Gujarat. The study populations have been sampled spanning the entire area of 
Gujarat. 


3.2.2 Maharashtra 


Like Gujarat, this was also carved out as a linguistic state in 1960 at the time 
of independence (Agrawal and Agarwal, 1995). Maharashtra lies in the mid-western 
part of India. It is surrounded by the Arabian Sea in the west, Gujarat, Dadar and 
Nagar havelli in the north, Madhya Pradesh in the northeast, Chattisgarh in the east, 
Andhra Pradesh in the southeast, Karnataka in the south and Goa in the southwest. 
The state covers 307,731Km” in area and contains two reliefs: the Deccan tableland 
and the Konkan coastal strip. Sahyadri hills are the backbone of Maharashtra which 


separates the two reliefs. Tribal populations such as Katkari, Warli, Kokni and caste 
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Table 2: Demographic table for various study populations 
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populations such as Deshastha Brahmin, Chitpavan Brahmin, Dhangar, Maratha and 
Parsee are the major populations that inhabit this region. Korku and Kolam tribes are 
found in Gavilgad ranges of Satpura hills. Gonds are found in Gondwan region, a hill 
which extends from Vidarbha region of Maharashtra, to the west of Chhattisgarh 
through North of Madhya Pradesh. The present study includes all the populations that 
inhabit these regions. 


3.2.3 Karnataka 


Karnataka is a south west state of India. It is bordered by Arabian sea to the 
west, Maharashtra to the North, Andhra Pradesh to the east, Tamil Nadu to the 
southeast and Kerala to the southwest. The area covered by the state is 191, 976 Km’. 
It has been the homeland to Kannadigas, Kodava, Tuluvas and Konkani speakers. 
Geographically, it has three principal regions: coastal Karavali (Dakshina Kannada, 
Udupi districts), hilly Malenadu covering Eastern and Western Sahyadri ranges 
(Uttara Kannada, Shimoga, Chikkamangaluru, Kogagu, and Hassan districts) and 
plains of Deccan plateau called Bayaluseeme (North Bayaluseeme includes regions 
Belgaum, Gulbarga, Bidar, Dharwad, Chitradurga and Raichur districts; Southern 
Bayaluseeme includes Bangalore, Mysore, Kolar and Mandya districts). The sampled 
populations from Karnataka included all the above mentioned language speakers from 
Karavali, Malenadu and South Bayaluseeme regions. 


3.2.4 Andhra Pradesh 


Geographically, Andhra Pradesh lies to the southeast coast of India. It is 
bordered by Maharashtra, Chhattisgarh and Orissa to the North, Tamil Nadu to the 
south and Karnataka to the west. To the east is the Bay of Bengal. It occupies an area 
of 2,172,000 Km”. It has three regions. They are, northern plateau region called 


Telangana, Southern part, the Rayalseema and Coastal Andhra. Telangana and 
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Rayalseema are divided by the river Krishna. Coastal Andhra includes the districts 
between the Eastern Ghats and Bay of Bengal from the north of Orissa to south of 
Krishna delta. These districts include Srikakulam, Vizianagaram, Vishakapatnam, 
East Godavari, West Godavari, Krishna, Guntur, Prakasam and Nellore. This study 
mainly focused on the populations from Coastal Andhra Pradesh and Godavari 


districts. 


3.3 Sample collection and DNA extraction: 

30ml of plain commercial bottled water (Aqua) was used for collecting mouth 
wash sample. This method is user friendly and non invasive. Large number of 
samples could be collected in a reasonably less time. In short, 30ml of aqua was given 
to each volunteer in a plastic cup. The cups, questionnaire and informed consent of 
the volunteers were given unique identifiers. The volunteer was asked to swish the 
water in his mouth for one minute and spit the contents into a plastic cup. 50ul of 
30% sodium azide (P/N 0191/3391/06013, S.D. Fine Chem Ltd) was added as a 
preservative to the mouth wash collected, to prevent any further growth of microflora. 
The sample was rested for some time for the food particles to settle and then decanted 
into a 50ml tube (Cat Np.227261, Greiner Bio One). The samples were transported 
and the initial step of cell isolation was performed in the makeshift camps. 

The samples were centrifuged at 2500 rpm for 10 min to settle the buccal 
cells. The supernatant was discarded and to the pellet, ml of White Cell Lysis Buffer 
(WCLB) was added (Appendix 8). This was transferred to a 1.5 ml micro centrifuge 
tube (P/N: 616201/ Griener Bio-One) and couriered to the parent laboratory at 
Madurai Kamaraj University, Madurai. The samples were couriered every three or 
four days to the laboratory: this was to avoid any damage caused by long term 


storage. 
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In the laboratory, further steps of DNA extraction were carried out by other 
research fellows and technicians. Salting out method (Ausubel, 2002) was employed 
to extract DNA. The samples received were transferred to a 15ml tube (P/N: 
188271/Griener Bio-One) and 1 more ml of lysis buffer was added to make up the 
volume to 2 ml. The samples were incubated at 42°C for two hours in slanting 
position. After the incubation, | ml of 6 M NaCl (SRL 828947, recrystallized) was 
added to the samples, vortexed for 10 seconds and placed on crushed ice for 10 
minutes. The samples were then centrifuged at 4000 rpm for 10 minutes. The 
supernatant was transferred to another fresh 15ml tube. To this, equal volume of 
100% ice cold ethanol was added. The samples were mixed gently by rolling and 
inversion. Precipitated DNA was visible at the interface in most of the samples. But in 
some cases the precipitate was not visible to the naked eye. The precipitate was 
centrifuged at 4000 rpm for 10 minutes and the supernatant was discarded leaving the 
pellet in the tube. To the pellet 1ml of 70% ethanol was added and transferred to 
1.5ml micro centrifuge tube. The samples were then centrifuged at 8000 rpm for 3 
minutes. The 70% alcohol wash step was repeated twice and finally the DNA 
obtained was air dried and suspended in 150ul of Tris- EDTA buffer (Appendix 8) 
and incubated at 42°C overnight. The DNA was stored at -20°C. 


3.4 GENOTYPING: 


3.4.1 DNA dilutions: 

As a first step for the preparation for genotyping, DNA was diluted 10 times 
(first dilution stock of 1001) in 10mM: Tris - EDTA 0.1mM buffer, in 96-well flat 
bottom dilution trays (P/N 655201/ Griener Bio-one). From this 21 was used for 
quantification of DNA by Quantifiler assay. The data obtained from this assay was 


used to prepare subsequent assay-specific templates for dotting (YSNP assay) and 


a5 


Multiplex assays. All the PCRs were performed as per manufacturer’s 
recommendations. 
3.4.2 DNA Quantification: 

The DNA thus obtained was estimated by using Quantifiler kit (P/N: 4343895 
Applied Biosystems, ABI) in ABI 7900HT Fast Real time PCR system (S/N 
279000947, ABI). This assay targets the human telomerase reverse transcriptase gene 
and hence the bacterial and other DNA were not considered. The Real Time PCR 
employed TAQMAN chemistry: the probes specific to this gene was labelled to FAM 
dye, whereas IPC (Internal PCR control) was labelled with VIC dye. The IPC is a 
synthetic sequence that is present in the Quantifiler PCR mix. It is amplified with 
each sample during PCR and this helps in detecting PCR failures and the inhibitors. 

For Quantifier assay 25u1 of PCR mix was added to each well of 384-optical 
plate (P/N: 4309849, ABI) and 21 of respective DNA, was added to it and sealed 
with optical sealer (P/N: 4311971, ABI). Eight standards were tested along with the 
test samples. Assay was carried out as per the manufacturer’s protocol. The results of 
the PCR were analysed using Sequence Detection System v2.3 (SDS) software, ABI. 
The amount of human DNA present was measured by Absolute Quantification 
method (Fig 6). 


3.4.3 NRY - SNP Genotyping: 


The samples were genotyped for Y-SNPs using Taqman Chemistry, referred 
as 5’ nuclease assay. Here, the biallelic states of YSNPs were detected in a Real Time 
PCR assay, using probes specific for each allelic state (ancestral and derived). The 
probes specific to the ancestral state were tagged with VIC dye (green) whereas the 
probes specific to the derived state was tagged to FAM dye (blue) in most cases. 


These reporter dye molecules were attached to a quencher molecule. When there is no 
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Figure - 6: Absolute Quantification : PCR Amplification Plot 
Amplification Plot 
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Note:The X axis shows the PCR cycle number and the Y axis shows the fluorescence. 
Samples with higher quantity of DNA shows amplification in the early cycles of PCR. 


hybridization between the probe and the template, no fluorescence will be emitted, 
due to the proximity of the quencher molecule to the reporter dye (Fluorescence 
Resonance Energy Transfer). Upon hybridization of the probe with the template 
DNA, the quencher molecule is distanced due to 5’ nuclease activity of the Taq 
polymerase used. The fluorescence thus emitted is captured to detect the derived or 
ancestral state of the YSNPs in the same well. 

All the 2,522 samples were studied for a total of 52 YSNPs, by custom made 
probes obtained from Applied Biosystems, Foster City, USA, specifically for The 
Genographic. Firstly, 1 ul of 10ng/yl of the DNA was dotted onto a 384 well optical 
trays. 96 samples were dotted in four quadrants of the 384 well plates and hence four 
different SNP probes could be tested in a single PCR run. The DNA dotted plates 
were allowed to air dry before the PCR setup. The PCR reaction was set up as per 
manufacturer’s protocol (“Allelic discrimination.assay.ABI.online protocol,” 2012) 
To the pre-dotted trays, Taqman genotyping master mix (P/N: 4326614, ABI) and 
custom made probe/primer mix were added to make the volume to 5pl. The 
fluorescence, before and after the PCR was measured. The specific alleles detected 


were visualised by allelic discrimination plot in Sequence Detection System (SDS) 


v2.0 software (Fig 7). 
Stage Temperature Time Cycle 
I 95 10min 1 
I 95 15sec 
60 Imin30sec 50 


The SDS software output was fed into Autocaller software v2.3, to assign the 
ancestral and derived state. The haplogroup assignment programme developed by 


IBM group of The Genographic project assigned the haplogroups based on the results 
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Figure-7: Allelic discrimination plot of NRY SNP genotyping 
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Note:Blue dots (FAM) represent the presence of NRY HG(M304, here) and red dots(VIC) 
indicates the absence of this SNP. No Template Control (NTC) and Female Control (FC) 
show low fluoroscence. 


of Autocaller using the Y-Chromosomal phylogenetic tree 2008 (Karafet et al., 2008). 
These were verified manually as well using the HG assignment pattern based on 
YSNP hierarchy (Appendix 9). 


3.4.4 YSTR Genotyping: 


A set of 17 YSTRs (Appendix 10) were genotyped using AmpF/STR YFiler 
PCR amplification Kit (P/N: 4359513, ABI). Multiplex PCR assay was set up in a 96 
well Micro Amp reaction plate (P/N N801-0560, ABI) as per manufacturer’s protocol. 
In short, 10ul of 0.2 ng/yl of the DNA was used for PCR and amplified in Gene Amp 
9700 thermal cycler (“YFiler.ABI.Online.Protocol,” 2012) as per manufacturer’s 
protocol. The cycling conditions used for this assay is given below: 

Temperature profile for YSTR and Multiplex 2 assay: 


Stage Temperature Time Cycle 


I 95°C llmin 1 

II 94°C 1 min 
61°C Imin 30 
TPC 1 min 

Il 60°C 80min 1 
4°C 00 


After the PCR, the samples were subjected to fragment analysis assay. In this 
assay, 0.5u1 of the PCR product was added to 9u1 of Hi Di formamide (P/N: 4311320, 
ABI) and GeneScan LIZ-500 internal size standard (P/N: 4322682, ABI) in a fresh 96 
well plate. GeneScan LIZ-500 is present in all the samples and is tagged with the 
ROX dye. Allelic ladders, the pre amplified PCR product of the 17 alleles, were also 
included in the fragment analysis assay as reference standards. The samples were 


electrophoresed in a 3130x/ genetic analyser (S/N 18233 022, ABI) with 50 cm 
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capillary loaded with Performance Optimised Polymer-POP7 (P/N: 4352759, ABI). 
The electrophoretic run separated the alleles based on their PCR product length. In 
case of alleles with same size, they had been labelled with different dyes thus 
allowing enumeration of the locus. Gene Mapper v3.1 software assigned the alleles 
automatically. The ‘bins’ in this served as the position where the allele of a specific 
size would be housed (Fig 8). The data was also manually checked and ambiguous 
ones were resolved by careful scrutiny or by re-runs. 

A custom made Multiplex PCR was also performed for 2 YSTR loci and 6 Y- 
Indels. The list of YSTR and indels studied are given in Appendix 9. This assay was 
called as Multiplex 2. The PCR and electrophoretic conditions applied are same as 
that of YSTR assay. The DNA concentration used for this assay was 2 ng/ul. 


3.5 Quality Control: 


Several steps of quality control measures were built-in at every step of 
genotyping. During DNA extraction, care was taken to handle the samples in sterile 
conditions, as they could be potentially pathogenic. The sterile laminar flow bench, 
the work room and benches were periodically sterilised by 70% ethanol, UV and also 
by fumigation as required. Lab coats, gloves and face mask were worn as protective 
measures while handling the samples and PCR setups. 0.1% sodium hypochlorite 
(27908, Qualigens, Mumbai, India) was used to discard the DNA extraction reagents 
and waste solutions. 

Lab technician assisted during the preparation of all the dilutions for various 
assays and dotting of 384 well plates for Taqman assay (Doer and checker). All the 
DNA samples were handled on ice to avoid any degradation. Probes and primer 
aliquots were prepared based on the requirement for each assay to avoid excess freeze 


thaw cycles. 
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Figure - 8: YSTR assay Data output seen as in Gene Mapper V3.1 software) 
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Note:The uppers lane shows the allelic ladder .The alleles (peaks) are shown within the ‘bins’ 
(Grey). The lower lane is the sample showing its characteristic allele peaks for each coli. 


The YSNP PCR runs were validated by using positive controls, negative 
controls, female DNA and NTC. The positive controls included the samples which 
would give a positive reaction to the derived allele under investigation. The negative 
controls included the SNP belonging to the ancestral allele of the YSNP under 
investigation. The female DNA should show no fluorescence as there would be no 
amplification. The No Template Control samples (NTC) which included 1pl of TE 
buffer instead of the DNA, also showed low fluorescence. 

For YSTR assay the positive and negative controls provided in the Yfiler kit 
(“YFiler.ABI.Online.Protocol,” 2012) were included in every PCR setup. The allele 
assigned for the positive controls were verified with the Yfiler kit product insert. For 
Multiplex 2 assay, the laboratory personnel’s DNA were used as control. These 
samples were used during all the PCR set ups. Allelic ladders were run along with 
each plate for fragment analysis. The sample allele peaks were compared with that of 
allelic ladder to assign the allele call. 

3.6 Statistical Analysis: 

The samples were analysed based on various statistical tools. The multi copy 
markers DYS385a and DYS385b loci were eliminated from all the analysis due to the 
ambiguity in distinguishing these loci. As DYS389I is embedded in DYS389I], the 
STR repeat values of DYS389II were subtracted from DYS389I and the value was 
used as DYS389b. The NRY HG frequency table was calculated by Microsoft Excel 
2007. Fisher exact test was performed to access the non-random behaviour of the 
observed frequencies. It was calculated in Microsoft excel using an add-in (Obert , 
2005). Nei gene diversity (Nei, 1987) was estimated to determine the NRY HG 
diversity. This gives the probability that two randomly chosen samples have different 


YSNPs in a given population. 
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The maps showing the pie charts representing the NRY HG composition in the 
respective geographical area was performed in SAGA version 2.0.7 (SAGA 
Development Team, 2008). The contour maps were constructed using 3d FEILD 
v3.5.3 software using Kriging method (Vladimir, 2012). 

To analyse the exact test of population differentiation, hierarchical AMOVA 
was performed by using Arlequin v3.5.1.3 (Excoffier et al., 2005).The three 
hierarchical levels among populations within group (Fsc), within populations (Fst) 
and among group (Fct) were computed along with their p values for 1000 
permutations. Fst genetic distances based on YSNP allele frequency and Rst distances 
based on YSTR were also computed using Arlequin v3.5.1.3. 

The evolutionary history of populations were inferred using the Neighbour- 
Joining (NJ) method (Saitou and Nei, 1987). NJ trees were computed and plotted 
using the software MEGA4 (Tamura et al., 2007). Pairwise Fst and Rst distances were 
used in the computation of these trees. To compliment this, Principal Component 
Analysis (PCA) (Jollifee, 1986) was performed using HG frequencies. The eigen 
vector associated with the largest eigen value has the same direction of first principal 
component. The eigen value associated with the second largest eigen value determines 
the direction of second principal component. The significant Principal components 
were identified using skree plot (Cattell, 1966) indicating the fraction of total variance 
in the data as represented by each PC. PCA was computed using R version 2.11.0 
statistical software (R.Development.Core.Team, 2010). To access and visualise the 
similarities or dissimilarities among the study populations, Multidimensional Scaling 
(Kruskal,1964) was computed based on Rst distance in R version 2.11.0 statistical 


software. 
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Age of YSNP lineages of a population were calculated using Average Square 
Difference (ASD) method as mentioned by (Sengupta et al., 2006). The average 
square differences between all the current Y chromosomes and the founder haplotype 
was calculated and averaged over loci. Standard error was computed over loci. The 
ASD value was divided by w, where w is the average Y-STR mutation rate of 
0.00069 for 25 years (Zhivotovsky et al., 2004). The age was expressed as Kilo years 
(Kya). Haplotypes from populations with sample sizes of 5 and above for a given 
haplogroup was selected for ASD estimates. 

Phylogenetic networks and mismatch distributions were computed in Network 
software version 4.6.10 (Fluxus. Technology. Ltd, 2012). Reduced Median (RM) 
networks were plotted with a reduction threshold of 1 (Bandelt et al., 1999;(Forster et 
al., 2000). The weights were applied inverse to the STR variance. The weights 
assigned to variances 0-0.2 was 10, 0.2-0.4 was 8, 0.4-0.6 was 6, 0.6-0.8 was 4 and 
>0.8 was 2. The input files were prepared using the programme designed by M/s 
Chella softwares, Madurai. Mismatch distributions were also computed using the 
Network software for the STR belonging to specific haplogroup under study. It 
determines if the observed variance in the populations is an effect of any demographic 
event (Slatkin and Hudson, 1991). 

The presence of ancient haplotypes in the populations were determined as the 
Sum of Squared distance (SSD) from the median haplotype for that HG. This method 
assumes the median haplotype to be the founder haplotype (Sengupta et al., 2006). 
Smaller SSDs represent older haplotypes while larger SSDs represent recent 
haplotype. 

Coalescent methods implemented in BATWING (Wilson. et al., 2003) were 


applied to compute the split times of the populations under investigation. This 
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software assumes no gene flow among the populations. However, it showed 
remarkable sensitivity to gene flow between populations with paternal lineages from 
same HGs and lower sensitivity to immigrants bringing newer HGs into the parent 
population (ArunKumar et al., 2012; Haber et al., 2012) priors used for determining 
the slit times and constructing the phylogenetic tree were as follows: 

e Mig model was set to 1, assuming samples are drawn from sub 


populations and 0 when no population sub-division was considered 


e Size model was set to 2 ie., the populations remained constant and then 


expanded. 


e The mutation rates were based on Zhivatosky’s evolutionary rate i.e., 
0.00069/site/generation (Xue et al., 2006). The prior for population 
was based on the ancestral population size during the Pliestocene 1.e., 


10,000 (Harpending et al., 1998). 


e The growth rate (alpha) was set to 0.005. The generations before which 


the population growth starts (beta) was set as 2. 


e The number of Markov Chain cycles was set tol.5 million. 


The post processing of BATWING data was performed in R v 2.11.0 
statistical package. 0.5 million samples were removed as burn-ins. The split times, 
TMRCA, total effective populations sizes and population expansion times were 
determined with 95% confidence intervals. The phlogenetic tree was plotted using 
Dendroscope (Huson et al., 2007). 

Structure v2.2 software was employed to detect the underlying genetic 


structure among a set of individuals using YSTR data. This is a Bayesian model based 
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on K mean clustering that assigned assign individuals to hypothetical ancestral 
populations. A model in which there are K ancestral populations (where K may be 
unknown), each of which is characterized by a set of allele frequencies at each locus. 
Individuals in the sample are assigned (probabilistically) to populations, or jointly to 
two or more populations if their genotypes indicate that they are admixed. It computes 
the proportion of genome of an individual originating from each inferred population 
(quantitative clustering method). The number of MCMC cycles was set to 10,000 and 
after burn-in length of 1,00,000. Several runs with different Ks were performed. The 
run with the maximum likelihood for a given K was considered to have captured the 


best structure from the data (Pritchard et al., 2000). 
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RESULTS 


4. RESULTS 


4.1 The logic of studying populations from Deccan and Gujarat 

Deccan is one of the oldest geophysical regions of the world. It was made of 
multiple layers of solidified flood basalt and step like hills, forming the landscape of 
this region. As on date it is bound by Vindhyas on the North, and the three oceans: 
Arabian sea on the West, Indian ocean on the South and Bay of Bengal in the East ( 
Britannica Online Encyclopedia, 2012). Currently the Deccan region consists of the 
four Dravidian states (viz: Tamil Nadu, Kerala, Karnataka and Andhra Pradesh), 
Maharashtra, southern Madhya Pradesh and Orissa. Its topography, geology, optimal 
climate supported by two monsoons (southwest and northeast) supported the life of a 
variety of species of plants and animals. The Western Ghats, present in the west side 
of the Deccan plateau, possess a very high diversity of flora and fauna. 
Archaeological evidences from the Deccan, especially Karnataka, Jeruru Valley, 
support an early inhabitation of this region by Man during the Palaeolithic (Petraglia 
et al., 1998). 

Genetic studies based on the NRY have described the first coastal migration 
from Africa to Australia through this region (Wells et al., 2001). The laboratory at 
Madurai contributed significantly to this discovery and subsequent studies have 
shown a rich genomic diversity and cultural heritage in Tamil Nadu and Kerala, the 
two Southern most states of Deccan India. Ensuing this first coastal migration, an 
early settlement in Western Ghats has been suggested based on NRY evidences from 
this laboratory (Kavitha, 2008; Arunkumar ef al., 2012). Apart from the early 
inhabitation of the Deccan, this region had been subject to clear population 
differentiation developing a caste/tribe specific distribution of NRY, in contrast to 


mtDNA (Thangaraj et al., 1999; Bamshad, 2001;Cordaux et al., 2003). The ancient 


45 


inhabitation of the Southern Deccan has given rise to the origin of a few NRY 
haplogroups (L1-M27, H1-M52) (Sengupta et al., 2006). 

From the linguistic point of view, the people of Deccan speak languages 
belonging to the Dravidian linguistic family, though majority of the people as on date 
speak one or the other south Dravidian languages, presumably originating form a 
common root, proto-Dravidian (Renfrew, 1996). Nonetheless, the Central Dravidian 
speakers the majority of them being the tribe Gond, are distributed in larger numbers, 
~3 million, ranging from Eastern Maharashtra, Madhya Pradesh to Orissa. It has been 
previously described from our lab that these central Dravidian speakers are 
genetically closer to the geographically proximal Austro Asiatic speakers than the 
linguistically proximal south Dravidian ones. They have shown evidences of 
expansion unconnected with the populations of southern Deccan particularly Tamil 
Nadu (ArunKumar, 2012). 

The NRY genetic structure of Tamil Nadu has been laid during Palaeolithic 
period, well before the Sangam Epoch of Tamil classics dating ~200 BC and the 
introduction of the Varna system (ArunKumar et al., 2012). The Dravidian kinship, as 
well as the languages are thus unique to Deccan and highly evolved (Trautmann, 
1981). Studies on Kerala populations, except select tribes of Western Ghats, have 
shown a levelled presence of different NRY lineages which were attributed to various 
factors such as Roman connections, sea farers and social movements (Kavitha, 2008). 
In the light of this scenario, it was of interest to study the NRY profile of various 
other linguistic states of Deccan: a language co-evolves with culture and gene pool. 
Thus I studied 2,522 samples from 33 castes and 18 tribes, from four States viz. 
Karnataka, Andhra Pradesh, Maharashtra and also Gujarat (Table 2) that lies on the 


coastal route of early population movements into India. All these states have many 
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exotic tribes and well differentiated castes. I thus tested if the caste tribe divide, 
language, geography or other social characteristics explained the genetic diversity 
observed among the populations of Deccan and Gujarat. The fidelity of NRY markers 
was appreciable and conclusive, and this was possible with sampling strategy and the 
genomic techniques employed in The Genographic project. 

In the following chapters of results, the data and analyses are presented state 
wise, so that it is easier to define various issues in each state. The results are then 
interpreted in the light of other data available from our studies on other states and 
others studies, thus drawing a holistic picture of peopling of Deccan India, through 
NRY HG and STR markers. The fidelity of these markers to populations, language 
and vocation seems to have thus been determined quite early in the evolution or 
settling of these populations in India. 

4.1.1 Gujarat — The coastal gateway to India 

A total of 413 samples belonging to 7 castes and 6 tribes residing in various 
regions of the state, Kutch, Sourashtra and the regions adjoining Narmada were 
collected and studied (Fig 5). The Y chromosomal data were analysed to appreciate 
the genetic composition and their relationship to geography, subsistence language and 
social ranking. The ethnographic details of the studied populations are presented in 
Appendix la. 


4.1.1.1 NRY Haplogroup frequency distribution in Gujarat study populations: 


Table 3 presents the list of populations studied and their NRY HG percentage 
frequencies. When all the study populations of Gujarat were considered together, 
NRY HGs Rlala-M17 (21.7%), HG H2-Apt (19.8%) and HG Hla*-M82 (17.6%) 
were present in higher frequencies, accounting to 59% of the total NRY diversity. 


When individual populations were considered, the HG Rlala-M17 was the most 
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frequent HG present in many caste populations studied, with an absolute frequency in 
Rajput (100%). Brahmin Kutchi (75%, Fisher Exact Test (FET) 6.E-06) and Brahmin 
Sompuri (71.4%, FET= 9.E-07) showed significantly high frequencies of this HG. HG 
Hla*-M82 was found in highest frequencies among Koli (31.6%, FET 1.E-01), a 
caste population subsisting on fishing. It was interesting to note that Siddis, a 
population of recent African descent, showed the highest and most significant 
proportion of HG CR-M168 (54.1%, FET 6.E-23) in consensus with previous reports 
(Ramana et al.,2001). The HG H-M69 clades were predominant in tribal populations 
of Narmada valley (Fig 9). Thus the populations of Gujarat showed a wide spectrum 
of various paternal lineages, indicating their complex histories. 

On further analysis, the NRY HG composition showed an interesting pattern 
of distribution when the populations were divided into caste and tribes. The Nei Gene 
diversity of the HG distribution showed that the tribal populations of Gujarat were 
more diverse (0.8502 +/- 0.0136) than the caste populations (0.7197 +/- 0.0353) (p 
value <0.0001). The pie diagram depicting the NRY HG composition among the 
castes and tribes in various geographical regions of Gujarat further delineated the 
geographical barriers in the distribution of these castes and tribes and this reflected in 
the NRY. While Narmada valley tribes (Kathodia, Kotwalia, Vasava and Ratwa) 
showed higher proportions of Hla*-M82 (24.4% FET 2.E-04) and H2-Apt (36.4% 
FET 4.E-21), the Koli (Fisherman) from Sourashtran region showed higher 
proportions of Hla*-M82s, while Maldhari tribe showed J2a*-M410 (58.3%, FET 
4.21E-07) and L1-M27/76 (25%, FET 6.E-03) the most frequent in Southern India, 
particularly Tamil Nadu. In contrast to HG H-M69 clades, HG Rlala-M17 was 
present in almost all the populations with the highest proportions (~70%) in Brahmin 


populations and Gatvi an agricultural population followed by Jains (57.9%). 
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4.1.1.2 Neighbour Joining tree: 


The evolutionary history of these study populations were inferred using the 
Neighbour-Joining method. The optimal tree was computed based on Fst distances 
from HG data (Fig 10a) and Rst distances from STR data (Fig 10b). The caste and 
tribal populations formed two different clusters in both the trees: however the caste 
populations Patel and the fisher men population Koli clustered with other tribal 
populations of Gujarat. Further, all the caste populations except Patel, having higher 
frequencies of HG Hla*-M82 and HG J2a*-M410, clustered together in the second 
arm of the NJ tree. The Siddis stood apart in the tree but clustered distantly with the 
tribal arm. 


4.1.1.3 Analysis of Molecular Variance (AMOVA): 


To test whether the populations show any genetic differentiation when 
grouped based on language, geography or other social parameter, AMOVA_ was 
performed. The Fsc value indicates “Among population within group variance”. Fct 
describes “Variation among groups” and Fst indicates “Among population variance”. 
The Fsc values have to be lower as compared to Fct values, if there is any genetic 
differentiation among the population groups. So, when the study populations were 
clustered as castes and tribes, Fsc was higher as compared to Fct for both YSNP and 
YSTR. However, when the Patel and Koli were removed and AMOVA was 
recomputed, the Fct value (0.248) was 1.7 times more than the Fsc value (0.140) 
(Table 5). 

Table 6 shows the Fst distance matrices heat map. It can be appreciated that 
the Fst and Rst distances between caste populations and tribal populations were high. 


However, Rajputs were very distinct from the other populations of Gujarat. 
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Figure 10a: NJ tree based on NRY HG-Fst distances for Gujarat study 
populations 
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Figure 10b: NJ tree based on NRY STR -Rst distances for Gujarat study 
populations 
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Note: The numbers within the brackets indicate the sample size (N) studied for each 
population. The branch lengths indicate the genetic distances between the internal 
nodes. The pattern of clustering were similar in both the trees showing distinvtino 
between caste and tribe. Koli and Patel cluster closer to tribal populations. 
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4.1.1.4 Principal Component Analysis (PCA): 


PCA based on the NRY HG frequencies was computed to determine the 
genetic relationships among the populations (Fig lla, 11b). The Component 1 and 
Component 2 contributed 57.6% and 17.3% variance respectively and these were 
influenced by HG Rlala-M17 & HG J2a*-M410 respectively. The tribal populations 
that distinguished from the rest was characterized by the NRY HG Hla*-M82 vector. 
Patel and Koli populations clustered along with these tribal populations irrespective of 
their caste hierarchy. 
4.1.1.5 Multidimensional Scaling (MDS): 

MDS was computed from YSTRs based Rst distances to evaluate the genetic 
differences among the populations (Fig 12). The analysis gave a stress value of 6.156. 
The MDS plot reflected the clustering of PCA, though in MDS the populations were 
widely distributed which can be attributed to isolated YSTR evolution within each 
population. Nonetheless, the plot showed a good distinction between caste and tribal 
populations studied. 
4.1.1.6: Phylogenetic Network Analysis: 

The reduced median phylogenetic networks were computed from YSTRs at 
the background of each haplogroup to infer evolutionary relationships among various 
populations within that HG. Only the Figs of networks that were highly informative 
are presented. The interpretations drawn from each of the HG are presented below: 
NRY HG CR-M168: The network was highly reticulated and present only Siddi 
populations. 

NRY HG C5-M356: This HG was represented only in Patel, Kotwalia and Vasava 


populations. The network showed reticulations with no central node, long branches 


50 


Figuer 11a: Principal Component Analysis of NRY HG frequencies of study popula- 
tions from Gujarat 
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Note: The tribal populations are indicated as red squares whereas the caste popula- 
tions are indicated by yellow circles. The biplot, shows the contribution of each hap- 
logroup represented by lines as component loading vectors. The percentage variance 
contributed by the PC1 and PC2 is shown in the Scree plot. The PCA plot showed 
distinction between caste and tribal populations 


Figure 12: Multi Dimensional Scaling of NRY-STR —Rst distances for populations of 
Gujarat 
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Note: The tribal populations are indicated as red squares whereas the caste populations are 


indicated by yellow circles. The MDS plot showed distinction between caste and tribal popu- 
lations. 


with several unoccupied steps, indicating diverse sources/long term evolution of these 
haplotypes. 

NRY HG H1la*-M§82: This HG was identified mainly in the tribal populations with 
long branches and multiple un-occupied steps, indicating the possibility of drift or 
multiple distant sources of this HG among these study populations (Fig 13a). YSTR 
evolution was detected in Kathodia as evidenced by single step mutations. Minimal 
haplotype sharing was observed suggesting no recent gene flow among the 
populations. 

NRY HG H2-Apt: This HG was also mainly identified in tribal populations. 
Haplotype sharing among Kathodia and Kotwalia were observed along with stepwise 
STR evolution. Vasava showed evolution from this Kathodia — Kotwalia cluster 
suggesting a common founder group. Many terminal branches were observed in the 
network suggesting a diverse sources and/or recent in-migration of this HG in various 
populations (Fig 13b). 

NRY HG J2a-M410: This HG is seen mainly in Patel and Maldhari populations. The 
Maldharis (N=7) showed unique signature with same haplotype for all 17 loci, in all 
the samples. The YSTRs of Patel were more diverse indicated by long branches, 
indicating diverse source for these haplotypes. 

HG Q1a3-M346: Kotwalia populations show step wise mutations at the periphery of 
the branches indicating recent evolution of this HG among Kotwalia. 

HG Rlala-M17: Fig 13c showed two distinct clusters among the Gujarat populations 
without a central median haplotype. In cluster 1, population specific clusters were 
identified among the Brahmins and Gatvi. Population specific YSTR evolution was 
observed among Gatvi. The cluster 2 comprised of mainly Rajput, Patels and Jains. 


These clusters indicate unique YSTR differentiation among these populations within 
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Figure 13a-13c: Reduced median phylogenetic network analysis of Gujarat study populations 
Figure 13a: NRY HG Hla*-M82 
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Figure 13b: NRY HG H2-Apt 
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HG Rlala-M17. The tribal populations showed long branches indicating distant 
sources for these haplotypes. The caste populations, Koli and Patel were present 
sporadically within the phylogenetic network. This network clearly differentiates two 
sources of Rlala-M17 among the Gujarati populations. 

HG R2-M124: This HG was sporadically represented in study populations with no 
specific populations cluster. It was characterised by long branches with multiple 
unoccupied steps, again indicating genetic drift or distant sources for these 
haplotypes. 

4.1.1.7 Mismatch distributions: 

Mismatch distribution analysis of YSTR data was performed to obtain the 
molecular distances within a haplogroup for all the populations (Fig 14a-14h). It 
determines the molecular proximity of a haplotype in a given population. The HGs 
CR-M168 and C5-M356 mismatch distribution plot was characterised by multimodal 
peaks with high MPD. The networks of these HGs showed heavy reticulations 
suggesting the possibility of long term drift. On the other hand though the mismatch 
distribution plots of J2a*-M410 and Qla3-M357 showed multimodal peaks, the 
YSTR networks did not show reticulations, indicating multiple sources of YSTRs for 
these haplogroups or in-migrations. HG H2-Apt showed a distinct bimodal peak 
indicating at least two different sources for these HGs. HGs Hla*-M82 and Rlala- 
M17 showed a single modal peak indicating a single source for these haplogroups but 
lower Mean Pairwise Difference (MPD) values indicating recent evolution for these 
haplogroups in study populations. The network of Rlala-M17 showed two distinct 
clusters while only a single unimodal peak was observed in the Mismatch distribution. 
This could be explained by the fact that the two clusters in the network were only 


single step away. 
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Figure 14a - 14h: Mismatch distribution based on YSTRs within each 
haplogroup for study populations from Gujarat 
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4.1.1.8 BATWING Analysis: 


A set of three BATWING runs were performed to estimate the divergence 
time of the populations studied. The first BATWING run consisted of all the Gujarati 
populations and a coalescent tree was obtained (Fig 15). The tree showed distinct 
clustering of castes and tribes with a coalescent time of ~6kya. Maldhari and Ratwa 
showed a recent split of 977 Ybp. Similarly Kathodia and Kotwalia also showed a 
recent split of 1.1Kya. These four populations shared a common ancestor around 
2Kya. Vasava had a common ancestor with these ~3Kya. Brahmin Kutchi and Rajputs 
showed a recent split of 1.5Kya. These populations coalesce with Brahmin Sompuri 
around 3kya. Jains and Patel showed a split time of 5.7Kya. TMRCA of all for all 
Gujarat populations studied was 52,252 Ybp (95% CI: 47,875-71,675) which was 
overlapping with the expansion times for all the lineages (47,195Ybp (95% CTI: 
45,037-82,134)) indicating a long term expansion of these populations in this region 
(Table 7, 8). 

The ancestral effective population size (Na) was estimated to be 1,977 (95% 
CI: 1,796-2,036) for all the Gujarat populations. Second set of BATWING 
simulations included two independent runs consisting of (i) Caste populations and (ii) 
tribal populations of Gujarat in each to estimate TMRCA and Na. The total Na of the 
caste and tribal populations was found to be 2,273 and 998 respectively. The TMRCA 
of the caste and tribal populations for all lineages was found to be 53,173 and 60,417 
years, suggesting that the tribal populations were ancient than the caste populations. 

To determine if the age of HGs present in each population reflected similar 
time frames, a third set of BATWING simulations, one for each population, was setup 
and the TMRCA of each HG in every population was computed (Table 9). Though 


Koli and Patels clustered with tribes in previous analyses the HG age estimates were 
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Table 7: BATWING estimates of Ancestral Effective populations size of various study states 


Ancestral effective 
population size (Na) 
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Note: 


The Ancestral effective population size has to be looked with caution as populations belonging to 
different study states have been used in different BATWING simulations 
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appreciably different among them. The HGs C5-M356, Hla*-M82, J2a*-M410, 
Rlala-M17 and R2-M124 showed similar ages among Patels and tribals. Similarly 
the HGs Hla*-M82, H2-Apt, J2a*-M410, L1-M27/76, Rlala-M17 and R2-M124 
showed similar ages between Koli and the tribes. This further supports the view that 
Koli and Patel are closer to tribal populations, although they are not similar among 
themselves. 

Another interesting observation was that the ages of various HGs are markedly 
different among caste populations than tribes. For example in Brahmin Sompuri the 
HG ages range from 17.4Kya to 26.9Kya and in Gatvi it ranges from 4.9Kya to 
47.8Kya; while in tribes the range of the age estimates were much smaller, example: 
Ratwa (21.1Kya to 35.5Kya), Maldhari (33.5Kya to 37.3Kya). This suggests that 
although caste populations have a lower Nei gene diversity, a result of lower number 
of HGs present, the histories of each HG in the caste populations is markedly different 
from each other which may be due to multiple event of gene assimilation. On the 
other hand in the tribal populations although the Nei gene diversity (Table 3) is more, 
the ages of the HGs are more uniform: this may be due to the evolution of each 
Gujarat tribes from a diverse ancestral gene pool and have not received much gene 
flow in the recent past. 

Overall the populations of Gujarat showed clear genetic variation in relation to 
caste and tribe divide and the caste populations were found to have more complex 


histories than tribes. 
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4.1.2 Maharashtra - Interface of North & South 

A total of 458 individuals from the seven tribal and six caste populations of 
Maharashtra were genotyped for Y HG and STRs (Table 2). The sampling locations 
are shown in Fig 5. Populations from Western Ghats and adjoining regions, 
Maratwada and interior Gondwana land were sampled. The ethnographic details of 
studied populations are presented in Appendix 1b. 


4.1.2.1 NRY- Haplogroup frequency distribution in Maharashtra: 


All the study samples from Maharashtra, totalling 458 showed appreciable 
frequencies of NRY-HG Rlala-M17 (20.3%), HG Hla*-M82 (17.9%), HG R2-M124 
(11.1%) followed by HG J2a-M410 (9.4%), HG O2a-M95 (7.9%) and L1 (6.8%) 
totalling 73.% of the total gene pool (Table 3, 4b and Fig 16) Nonetheless, at the 
populations level, Dhangars, a pastoral population had highest proportion of Rlala- 
M17 (45%, Fishers exact test 3.E-04), followed by Maratha (44.2%, FET 2.E-04), 
Deshastha Brahmin (41%, FET 7.E-02) and Chitpavan Brahimin (35.7%, FET 5.E- 
02). Highest frequencies of HG Hla*-M82 were seen in Raj Gonds and Gonds (75% 
and 70.3%, FET 2.E-13 and 2.E-05 respectively). Mang, an artisan population, 
showed high HG R2-M124 (31.8%, FET 6.E-03). Parsee, a migrant population from 
Zoroastrian region, Iran and a highly endogamous population residing in and around 
Mumbai, showed significant proportion of HG J2a*-M410 (33.7%, FET 1.E-13). 
Korku, an Austro Asiatic (AA) language speaking population had very high 
proportion of O2a-M95 (75%, FET 2.E-32), the commonest allele of AA speakers in 
Orissa and northeast India. Kolam, a Dravidian speaking tribe in west India showed 
equal proportion of HG O2a-M95, HG L1-M27/M76and R2-M124 (18.5%). The 
Nei’s gene diversity of caste populations was 0.8538+40.0140 and that comparable to 


the tribes exhibiting 0.856840.0122. 
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4.1.2.2 Neighbour Joining tree (NJ) 


The evolutionary relationships of the populations were studied by NJ tree 
based on Fst and Rst distances (Fig 17a, 17b). In Fst based Nei’s tree, the five caste 
populations studied clustered together with minimal distances between each of them, 
while the tribal populations clustered distinctly with higher distances from each other. 
Interestingly the two Gond populations clustered together. Similar picture was 
obtained with Rst based tree as well, except for the order of the caste population in the 
tree. Further in both the trees, Kolam, the Central Dravidian speaking tribe of 
Maharashtra was the closest to Warli and other tribes. Nonetheless the two Gonds, 
other Dravidian speaking populations were quite distant. Further the AA speaking 
Korku showed the greatest distance from Kolam and other tribes studied. 


4.1.2.3 AMOVA 


AMOVA values were obtained by grouping the study populations based on 
various parameters. The grouping based on three distinct geographical regions 
(Sahyadri, Gondwana, Satpura ranges) and languages gave a higher Fct (0.145, 0.116) 
and lower Fsc (0.104, 0.107) values for YSNP AMOVA (Table 10). Whereas Fct was 
(0.112, 0.092) and Fsc was (0.059, 0.054) for YSTR. Other groupings based on caste 
and tribe divide, and subsistence did not yield any appreciable differences. 

The matrix of Fst/Rst pair-wise distance among the study populations from 
Maharashtra is shown in Table 11. It was observed that the Central Dravidian (CDR) 
populations such as Gonds and Kolam are also genetically distant from each other. 
Korku, an AA speaking population is stands distinct in comparison to the others. 


4.1.2.4 Principal Component Analysis: 


The first two principal components in different populations revealed region 


based clustering (Fig 18a, 18b). PC1 and PC2 contribute 39.4% and 31.2% of the 
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Figure 17a: NJ tree based on NRY HG- Fst distances for Maharashtra study popula- 
tions 
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Figure 17b: NJ tree based on YSTR- Rst distances for Maharashtra study populations 
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Note: The number within the brackets indicate the sample size (N) studied for each population. 
The branch lengths indicate the genetic distances between internal nodes. Castes form a tight 
cluster while the tribes form a loose cluster different from the castes. 
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Figure 18a: Principal Component Analysis of NRY HG frequencies of study popula- 
tions from Maharasthra 
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Figure 18b : Scree plot for PCA (Figure 18a) components 
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Note: The populations are coloured based on their language affinities. Squares indicate tribal 
and circles indicate caste populations. The biplot shows the contribution of each haplogroup 
represented by lines as component loading vectors. The percentage variance contributed by 
each HG is represented in the scree plot 


variance respectively and were determined by HG Rlala-M17 and HG J2a*-M410 
vector respectively. The IE tribal and non tribal populations got differentiated in this 
vector. The third dimension was determined by HG H1la*-M82 vector, differentiating 
CDR speaking Gonds and Raj Gonds from others. HG O2a-M95 in the AA speaking 
Korku formed the next dimension, with 7% variation. Thus overall the populations 
clustered based on their language family. 


4.1.2.5 Multidimensional Scaling 


Non metric-MDS computed from the Rst matrix is presented in Fig 19. Stress 
value was 7.86 and R” value was 0.93. Similar to PCA, the Gonds and Korku were 
isolated that could be attributed to their distinct YSTR profiles which separated them 
in different directions from rest of the populations. Similarly, Korku, an AA speaking 
population was also isolated from the others. 
4.1.2.6 Phlogenetic networks: 

NRY HG C5-M356: The haplotypes of this HG were mainly present in IE speaking 
population, Warli. The network showed a central reticulation, long branches and 
multiple unoccupied steps indicating distant source of the haplotypes among these 
populations or loss of haplotypes by genetic drift. 

NRY HG H1la*-M82: The Dravidian speaking, Gond and Raj Gond were over- 
represented in the network. The branches were strewn from a hypothetical central 
node. Caste populations occurred sporadically. All these features indicated the less 
diverse sources and long term expansion among these populations (Fig 20a). 

NRY HG H2-Apt: This HG was mainly localised in IE speaking, artisan group- 
Katkari. Dhangar, an IE speaking pastoral population showed population specific 


cluster. Long branches with single step mutations at the periphery of the network 
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Figure 19: Multi Dimensional Scaling of NRY-STR —Rst distances for populations of 
Maharashtra 
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Figure 20a-20e: Phylogenetic network analysis of Maharashtra study populations: HGs 


Hla*, J2a*, L1, 02a, Rlala 
Figure 20a: NRY HG H1a*-Mg2 


Brahmin Deshastha Me 
Hl Dhangar ; 
BB Gong 

[Raj Gond 

BB Katkari 

BB Kokani 

BB Kolam 

BB Korku 

i Mang 

i Maratha 

OD Parsee 


B wari 


Figure 20b: NRY HG J2a*-M410 
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indicated only recent YSTR based evolution among these populations from 
multiple/diverse sources. 

NRY HG J2a*-M410: The haplotypes were mainly present in the Parsee population. 
However Brahmin populations also showed some population specific cluster. The 
presence of long branches and step wise mutations at the periphery, indicated recent 
evolution of these haplotypes in the study populations (Fig 20b) 

NRY HG L1-M27/76: The haplotypes of this HG were sporadically distributed 
among the populations of Maharashtra. No population specific clusters were found. 
The central node was occupied by Central Dravidian speaking population, Kolam. 
The haplotypes showed long radiating branches with multiple un-occupied steps 
indicating distant sources of its haplotypes (Fig 20c). 

NRY HG O2a-M95: This HG was localised mainly among the AA speaking Korku. 
This population showed single step YSTR evolution and also occupied the central 
node. The YSTRs of this HG was not shared by other populations. All these indicated 
long term isolation and evolution of this HG among Korku (Fig 20d). 

NRY HG Rlala-M17: This network showed two distinct clusters (Fig 20e). Cluster 
1 was mainly composed of populations such as Brahmins, Dhangar, Marathas and 
Parsees. Cluster 2 was mainly composed of Dhangar, Marathas and CDR speaking 
tribal populations. The Katkaris shared all the 17YSTRs within the population, 
indicating recent single source of this HG in this population. Cluster 2 did not possess 
any of Brahmin populations in comparison to cluster 1 which could indicate a 
different YSTR evolutionary pattern within this HG among Brahmin populations. 
NRY HG R2-M124: This HG was distributed mainly among the IE speakers of 


Maharashtra. Parsee showed population specific cluster. Long radiating branches in 
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Figure 20c: HG L1-M27/76 
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Figure 20d: HG O2a-M95 
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the network indicated genetic drift or distance source for the YSTRs among the 
populations. 
4.1.2.7 Mismatch distributions: 

Mismatch distribution analysis was performed for all the YSTRs within a HG 
for all the study populations (Fig 21 a — h). HGs C5-M356 showed multi modal peaks 
that could indicate multiple sources of this HG or loss of haplotypes due to genetic 
drift. HGs Hla*-M82 showed unimodal peak and high MPD value of 7.2. HG Rlala- 
M17 showed two peaks, with one having higher frequency than the other. This can be 
correlated to the two clusters of Rlala-M17 haplotypes in the network analysis. This 
plot with a high MPD of 8.9 indicated signatures of demographic expansions probably 
from two distinct sources. HG H2-Apt however showed a bimodal peak with high 
MPD value (12.635), indicating at least two different sources for this HG. HGs J2a*- 
M410, L1-M27/76, O2a-M95 and R2-M124 showed multi modal peaks, suggesting 
multiple sources for these HGs among the study populations of Maharashtra. 
4.1.2.8 BATWING Analysis and ASD estimates: 

The phylogenetic tree computed using BATWING (Fig 22) showed three 
distinct evolutionary groups. The clustering obtained by this analsysis was similar to 
that obtained from PCA and MDS (Fig 18, 19). Brahmin Deshastha and warrior 
Maratha showed a split time of 1,243 years. Brahmin Chitpavan had a recent split 
time with these population 1,019 years ago and the CDR speaking Kolam shared 
ancestry with these populations ~4Kya. The second branch was comprised of IE 
speaking tribes and AA speaking tribe separated ~10 Kya. The DR speaking Gond 
and Raj Gonds has a recent split time of 3,016 years ago. The Parsee, originally from 
Iran was stubbed to the common ancestor (11.5 Kya) of these two branches showing 


they are distinct from Brahmins and all other populations of Maharashtra. 
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Figure 21a-21h: Mismatch Distribution plots based on YSTRs within a HG 
for study populations from Maharashtra 
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The ancestral effective population size of the Maharashtra study populations 
was found to be 6,310 (95% CI: 5,882-6,640). The TMRCA was 57,868 Ybp (95% 
CI: 55,519-62,678), and the population expansion time was 47,921Ubp (95% CI: 44, 
899-49,656). The TMRCA (Table 8) and ancestral effective population size of caste 
population was 64,756Ybp (95%CI: 42,198-1, 08,638) and 1,565 (95% CI: 747- 
4,114) respectively. Whereas the TMRCA and effective population size of tribal 
population was 60,326Ybp (95 % CI: 3,09,088-1,01,847) and 1,855 (95% CI: 581- 
4,618). The TMRCA of tribes overlapped with that of caste. 

The ASD age estimates calculated for each HG in every population gave an 
interesting picture on the histories of the populations (Appendix 11). The ASD age for 
HG Hla*-M82 was uniform among the populations, which coupled with the diverse 
clusters observed in the network (Fig 20a) suggest that the source of this HG among 
the Mang, Gond, Katkari and Warli was unique. On the other hand Rlala-M17 
showed a diverse age estimates among the populations. The Brahmin populations 
showed the highest age of 27.3+11.4Kya followed by Dhangar (19.4+3.5). Through 
the network analysis (Fig 20e) showed two distinct clusters, most of the samples 
excepting Brahmins and Parsee were distributed in both the clusters. The wide range 
of age estimates of Rlala-M17 among these populations suggests a diverse source of 
this HG among them. 

The statistical analysis of Maharashtra populations thereby supports the view 
that geography and languages have co-influenced the genetic structuring of these 


populations. 
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Results 4.1.3 Karnataka - The land of mid-Western Ghats 

The regions of sampling in Karnataka are shown in Fig 5. A total of 877 
volunteers belonging to 3 tribes and 10 caste populations were studied for their YSNP 
and YSTR polymorphisms (Table 2). Attention was paid to South Canara region for 
the reasons adjoining to Nilgiris, all belonging to Western Ghats: the moist deciduous 
forests. The North Karnataka was purposefully avoided for the reason of later 
invasion and expansions such as Vijayanagar Empire. The ethnographic details of 
populations studied are given in Appendix 1d. 
4.1.3.1 Y Haplogroup frequency distribution 

When all the samples from Karnataka were analyzed for NRY HGs, Rlala- 
M17 and HG Hla*-M82 were present in a frequency > 20% (24.1%, 21.8% 
respectively). HGs such as R2-M124, HG L1-M27/M76, F*-M89 and J2a*-M410 
were present in the range of 10% to 5%. Whereas HGs C5-M356, H*-M69, H2-Apt, 
L3*-M357, J2a4c-M68, J2b-M221/M102, Qla3-M346 were present sporadically in 
the range of 3-1%. (Table 3, 4c, Fig 23). 

When individual populations were considered, F*-M89 was present in the 
frequency of 23.4- 2% in all populations except Brahmins and a tribal population, 
Koraga. HG Hla*-M82 was present in all the populations with the highest frequency 
in Koraga (89%, FET 2.E-38). HG J2a*-M410 was over-represented in Iyengars 
(28.6%, FET 1.E-06). HG L1-M27/76 was also present in all the populations with 
high frequency in Yerava tribe (17.2%, FET 2.E-02) that subsists on wet/dry land 
farming. HG L3*-M357 was found to be highest in Brahmin Havyaka (14.8%, FET 
2.E-06) among the study populations of Karnataka. HG Rlala-M17 was ubiquitous in 


all the populations with highest frequency in Brahimin Goud Saraswath (67%, FET 
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4.E-21). Havyaka Brahmin also represented the higher frequency of R2-M124 
(39.8%, FET 9.E-15). 

When the genetic diversity of the tribal and caste populations were compared, 
caste populations possessed significantly higher diversity (0.8578 +/- 0.0075) 
compared to tribes studied (0.7499+/-0.0314) with a p value <0.0001. 


4.1.3.2 Neighbour Joining Tree: 


The evolutionary histories of study populations were inferred from NJ trees 
based on NRY HG-Fst and YSTR based Rst distances (Fig.24a, 24b). The Brahmin 
Havyaka was genetically distanced from other Brahmin cluster. The wet/dry land 
farming populations (Bunts, Mogaveera, Gowda) shared minimal distances in both the 
NJ trees. Koraga, an artisan tribal population was isolated from the other populations. 


4.1.3.3 AMOVA 


The 13 study populations from Karnataka were grouped based on various 
parameters such as caste tribe divide, geography, regions (Karavali, Malenadu and 
South Bayaluseme), subsistence (Brahmins, wet/dry land farming, artisan and food 
foragers), language families (Dravidian and Indo European) and language dialects 
(Kannada, Tulu, Sanskrit, Kodava takk and Koraga) and Analysis of Molecular 
Variance was calculated (Table 12). It was observed that the variance among groups 
(Fct) values were lower compared to the variance among populations within groups 
(Fsc) values in all methods of grouping. If the variance among populations (Fst) 
values based on SNP is greater than that of Fst of STR that could be attributed to 
lineage specific gene flow into the populations. In the study populations, Fst values of 
SNP were found to be greater than Fst of YSTR. The AMOVA results suggested that 
genetic variations between various populations that were grouped based on social 


characteristics, language or geographical affiliations were not appreciable and 
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Figure 24a: NJ tree based on NRY HG- Fst distances for Karnataka study 
populations 
lyengar(42) 


0.11245 


BrahminGoudsaraswath(94) 


902182] K odava(50) 
Bunts(74) 
; Mogaveera(88) 
°X’Gowda(81) 
0.00097 
Adikamataka(64) 


0.05827 


Jenukuruba(26) 


1 Yerava(64) 
0.07459 


0.07732 


BrahminHawaka(88) 


Billava(58) 


0.22512 


Koraga(73) 


0.05308 


Kuruba(75) 


‘02 
Figure 24b: NJ tree based on NRY STR- Rst distances for Karnataka study 
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Note: The number within the brackets ( ) inidcate the sample size for each population. 
The branch length indicate the genetic distance between internal nodes. No definite 
-grouping pattern based on language, caste-tribe divide, geography or other social 
characters was obsereved. Only wet/dry land farming populations (Kodava, Bunts and 
Gowda) form a cluster in both Fst and Rst based NJ trees 
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confined to intra-populations variation only. This observation is in contrast to that 
observed in Gujarat and Maharashtra. 

Table 13 shows the Fst and Rst based genetic distances. The Koraga and 
Jenukuruba were genetically distinct from many other populations showing highest 
distance with Goudsaraswawth, Iyengar and Kodava. 


4.1.3.4 Principal Component Analysis 


Fig. 25 shows the PCA plot. In the PCA, PC1 explained 53.4% of the total 
variance mainly contributed by NRY-HG Rlala-M17. Brahmin populations such as 
Goud Saraswath (BGS) and Iyengar (Iyn_K) were differentiated by this vector. PC2 
contributed 20.9% (NRY HG H*-M69) which differentiated Jenu Kuruba, a tribal 
population that subsist on honey collection. Yerava and Adikarnataka populations 
were differentiated by F*-M89 vector. The percentage variances contributed by each 
PC component are shown in the Scree plot (Fig. 26b). 

The NM MDS plot, computed based on Rst distances gave a stress value of 
6.31 and R’ = 0.95 (Fig. 26). The MDs plot obtained was similar to the PCA in their 
clustering pattern. 


4.1.3.5 Phlogenetic networks: 


NRY HG C5-M356: The haplotypes of this HG were seen mainly in Yerava, a tribal 
population and Brahmin Havyaka with no central median node. Brahmin Havyaka 
formed a population specific cluster with no haplotype sharing with other populations. 
The branches within this cluster showed mutations that were one to two steps away. 
The other populations showed long radiating branches with several unoccupied steps, 
indicating distant source of the haplotypes among them or loss of haplotypes by 


genetic drift. 
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Figure 25a: Principal Component Analysis of NRY HG frequencies of study popula- 
tions from Karnataka 
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Figure 25b: Scree plot for PCA (Figure 25a) components 
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Note: In the PCA, squares represent tribal and circles represent caste. The populations are 
coloured based on their mode of subsistence. The biplot shows the directionality of loading 
PC components. The percent variance contributed by each PC component is given by the 


scree plot. It is to be noted that most of the wet/dry land farmers are clustere together at the 
center of the PCA 


Figure 26: Multi Dimensional Scaling of NRY-STR —Rst distances for populations of 
Karnataka 
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Note: The plot showed the clustering of the wet land farmers at the center of the plot. The 
brahmin populations were diverse 


NRY HG F*-M89: Four populations showed distinct populations specific clusters: 
Adikarnataka, Yerava, Jenukuruba and Gowda. Adikarnataka formed a distinct cluster 
with single step mutations, thereby indicting YSTR evolution of this paragroup. 
Yerava were represented by long radiating branches. Gowda showed long branches 
with single step mutations towards periphery. Jenu Kuruba, a foraging tribal 
population was also represented by long branches with multiple unoccupied steps. 
Gene flow between Gowda and Adikarnataka populations was observed. All these 
indicate that these populations either experienced long term genetic drift with traces 
of recent evolution of these haplotypes in some populations with minimal recent gene 
flow among them (Fig. 27a). 

HG H*-M69: This network was characterised by three distinct clusters. One of the 
was mainly occupied by the food gatherers-Jenukuruba population. Whereas 
populations in other two cluster were sporadic. The network showed reticulations with 
no central node. These features indicated that these populations had different sources 
of this HG and have experienced long term drift. 

NRY HG H1la*-M82: This network was mainly represented by Koraga, an artisan 
population and a pastoral population-Kuruba. The network clearly shows two 
different directions in which the populations have expanded (Fig.27b). Haplotype 
sharing was observed among the Koraga and Iyengar populations. Also haplotype 
sharing was observed among Kodava, Billava, Adikarnataka and Koraga populations. 
This indicated gene flow/origin from a common ancestor of populations studied. 

NRY HG H2-Apt: Brahmin Goudsaraswath showed populations specific cluster and 
YSTR evolution among them. The other cluster with long radiating branches included 
sporadic representation of different populations, indicating diversification of YSTR 


among Brahmin Goudsaraswath and other populations of Karnataka. 
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Figure 27a-f: Phylogenetic network analysis of Karnataka study populations 


Figure 27a: NRY HG F*-M89 
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NRY HG J2a*-M410: This network was found to have two distinct clusters of 
Brahmin Havyaka and Iyengar with step wise mutation. These indicate long term 
isolation and evolution of these YSTRs in the above mentioned populations. Billava 
also showed population specific cluster. No median haplotype was found and all the 
samples were in the periphery (Fig. 27c). 

NRY HG J2b-M221/M102: Hypothetical central nodes with long branches were 
noticed in this network. No population specific clusters were identified. The YSTRs 
could have had diverse origin for these populations within HG J2b. 

NRY HG L1-M27/76: The network was characterised by central median haplotype 
composed of Kuruba population. Long radial branches were strewn in all directions, 
radiating around the node with several unoccupied steps. Yerava formed a specific 
cluster. Haplotype sharing was observed. The lack of population specific clusters with 
limited evolution is suggestive of either gene flow among the populations or a recent 
origin from a diverse source (Fig. 27d). 

NRY HG L3*-M357: The haplotypes were over-represented in Brahmin Havyaka 
that showed a distinct population specific cluster with single step mutations in the 
branches indicating long term YSTR evolution within this HG. Gowda (N=5) did not 
show any YSTR evolution within this HG and indicated events of recent in-migration 
in them. 

NRY HG Rlala-M17: The network had a hypothetical median central node, with the 
branches radiating around this node. Brahmin Goudsaraswath was seen to be 
represented throughout the network, indicating high YSTR diversity of HG Rlala- 
M17 among them. Iyengars also showed population specific cluster. Minimal 


haplotype sharing was also observed among different populations (Fig 27e). 
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Figure 27c: NRY HG J2a*-M410 
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Figure 27d: NRY HG L1-M27/76 
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Figure 27e: NRY HG Rlala-M17 
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NRY HG R2-M124: This network is characterised by central reticulations and long 
radiating branches. Brahmin Havyaka and Yerava show population specific cluster 
with single step mutations among them. Haplotype sharing was observed among IE 
speaking Brahmin Havyaka and DR speaking Mogaveera. Kuruba showed distinct 
evolution among the YSTRs within this HG (Fig. 27f). 


4.1.3.6 Mismatch Distributions: 


Fig. 28a — 28k shows the mismatch distributions for each haplogroup. NRY 
HGs F*-M89 (MPD: 13.117), Hla*-M82 (MPD 6.7), HG L1 (MPD 7.7) and Rlala- 
M17 (MPD 7.8) showed a clear unimodal peak and high MPD indicating a long term 
demographic expansions. Although unimodal peaks were observed in these HGs, the 
highest MPD in F*-M89 is suggestive of a longer period of isolated evolution of this 
HG. HG H*-M69 (13.45) shows very high MPD values with multimodal peaks 
suggesting that these paragroups might be representing the unidentified markers. HG 
C5-M356 (MPD 12.7) showed multimodal peaks either indicating multiple sources of 
YSTRs or loss of YSTRs by drift. HG J2b-M221/102 (MPD: 8.9) and L3*-M357 
(MPD: 5.5) shows multiple peaks indicating more than one source of YSTRs. 


4.1.3.7 BATWING age estimates 


BATWING phylogenetic tree of the 13 populations studied showed a 
coalescence time ~9 Kya (Fig 29). The population split times were deeper for the 
populations of Karnataka (~7-4 Kya) as compared to Gujarat and Maharashtra. This 
indicated that the populations have been isolated for a longer time and are unique on 
its own. However Bunts and Billava had a recent split time of 2,241Ybp. The pattern 
of clustering in the phylogenetic tree computed by BATWING was different from NJ 
trees. The ancestral effective population size was found to be 21,727 (95% CTI: 


21,158-21,924) and TMRCA was found to be 80,240Ybp (95% CI: 79,486-90,809). 


66 


Figure 28a-28k: Mismatch Distribution analysis based on YSTRs within a NRY HG for 
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Whereas the population expansion times were 52,276Ybp (95% CI: 51,342-54,771 
(Table 8). It was interesting to note that the population expansion time was much 
smaller as compared to TMRCA with non-overlapping confidence interval 

The overall results therefore suggests that the populations of Karnataka showed no 
one to one correlation with language, geography or other social characters and have 


been isolated for at-least ~7Kya. 
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Results 4.1.4: Godavari Delta of Costal Andhra Pradesh — an ancient fertile 
river-fed settlement of Deccan. 

The Godavari delta of Andhra Pradesh was sampled for the study and the 
areas sampling is shown in Fig 1d Andhra Pradesh. Ethnographic notes are studied 
populations are presented in Table 1d. 11 caste and 2 tribal populations totalling 744 
male volunteers were studied for their NRY polymorphisms. All the samples 
belonged to west Godavari, east Godavari and coastal belts of northern Andhra 
Pradesh. Below Orissa this is the most fertile alluvial belt with rice cultivation and in 
the context of Pataliputra and Kalinga this region assumes significance in the trickling 
down of the populations along the east coast: hence this region was concentrated in 
the present study (Table 1). 


4.1.4.1 Y Haplogroup frequency distribution: 


An overall frequency of 26.7% of NRY-HG Hla*-M82 was the highest 
observed in the total population of Andhra Pradesh. NRY-YHGs RlalaM17 and L1- 
M27/76 were present in a frequency ~12%. Whereas HGs F*-M89, J2a*-M410, O2a- 
M95 and R2-M124 were present in frequency >5%. These HGs put together 
contributed to 79.5% of the overall frequency amongst these populations (table 3, 4d, 
Fig 30). 

However, the frequency distribution varied very widely among various caste 
and tribes. HG Hla*-M82 though ubiquitous among all the populations, was over 
represented in Relli (51.1%, FET 2.E-07) followed by Mala (46.3%, FET 6.E-05). 
NRY HG Rlala-M17 was present in high frequency in Brahmin Andhra-Neyogi and 
Vaidiki (ANV) (55%, FET 4E-06) as compared to other populations. Yadava, Raju, 
Kamma and Kapu show ~20% of NRY HG L1-M27/76 but were highly significant in 


Kamma and Kapu (23.1%, 24.4% FET 6E-04, 2E-04 respectively). HG F*-M89 was 
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present in many populations in appreciable frequencies (~10%): Raju and Konda 
Reddy however showed high F*-M89 (31% FET 1E-07, 21%, FET 9.E-03). The two 
tribal populations studied Konda Reddy and Konda Kammara showed considerable 
proportion of NRY HG O2a-M95, 27% and 32.6%, (FET S5E-08, 8E-05 respectively). 
Interestingly, Dravida Brahmins showed 24.3 % of NRY HG G-M201 (FET 2E-09) 
and 27% of HG J2a*-M410 (FET 4E-06) 

Nei gene diversity for caste populations was almost the same in caste and 
tribes (0.8668+40.0073, 0.8498+0.0237). 
4.1.4.2 Neighbour Joining Tree 

NJ trees based on Fst and Rst distances represent the evolutionary 
relationships among the populations (Fig. 31a, 31b) The SC populations (ie., Mala, 
Madiga, Jalari and Relli) and the tribal populations (ie., Konda Reddy and Konda 
Kammara) showed distinct clusters. The wet/dry land farming based populations 
(Kamma, Kapu, Yadava) clustered well based both on their Fst and Rst distance 
matrix. Settibalija, another wet/dry land farming population was placed at the upper 
arm of the core cluster of Kamma, Kapu and Yadava. The two Brahmin populations 
studied, Dravida and ANV both from East Godhavari were quite distinct from each 
other and they clustered separately with wet/dry land farming populations in both Fst 
and Rst based trees. This was reflected in their HG composition as well, indicating 
different paternal histories. Raju, a warrior class stood distinct. The NJ trees obtained 
based on YSNP and YSTR distances were similar. 
4.1.4.3 AMOVA 

The AMOVA computations based on various grouping such as language 
families (IE and DR), caste tribe divide, geography and subsistence was performed 


(Table 14). High Fct and low Fsc values were obtained when the populations were 
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Figure 3la:NJ tree based on NRY HG- Fst distances for Andhra Pradesh study 
populaltions 
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Figure 31b: NJ tree based on NRY STR- Rst distances for Andhra Pradesh study 
populations 
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Note: The number within the bracket () indicate the sample size studied for each 
population. The branch lengths indicate the genetic distance between internal nodes. In 
both the trees, it is to be noted that the poulations are divided in to four groups: 
1. Mala, Relli, Jalari and Madiaga representing SC population 
2. Konda Reddy and Konda Kammara represent the tribal groups 
3. Raju, a warrior group of Andhra Pradesh 
4. Yadava, Kapu, Kamma represent the wet/dry land farmers 
5. The brahmin related groups that are distant from each other 
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grouped based on their subsistence for both YSNP and YSTR (Fst: 0.0572 and 
0.0327; Fsc 0.0236 and 0.0167 for YSNP and YSTR respectively). An Fst value 
based on this grouping was high (0.0795 and 0.0489 for YSNP and YSTR). Based on 
subsistence the populations were grouped as Brahmin related, wet/dry land farmers, 
SC populations, food foragers and artisans. Among population within group variance 
Fsc, values were smaller as compared to variation among groups (Fct) when the study 
populations were grouped based on caste-tribe divide, language and geography. 

Table 15 shows that the pairwise Fst and Rst distances of the study 
populations from Andhra Pradesh. It was observed that the genetic distance between 
the two Brahmin populations and also with other populations was high. Konda 
kammara, a tribal population is genetically at minimal distance with the Konda 
Reddy, another tribal population and SC community, Jalari. The wet/dry land related 
populations (Yadava, Settibalija, Kamma and Kapu) share minimal Fst and Rst 
distances in comparison to other populations. 
4.1.4.4 Principal Component Analysis 

Fig. 32a shows the PCA plot coloured based on their mode of subsistence. The 
PC1 and PC2 components contribute to 33.7% and 27.3% variance respectively. The 
SC (Jalari, Relli, Mala and Madiga) were differentiated across HG Hla*-M82 vector. 
Konda Kammara, a tribal population differentiates along HG O2a-M95. The Brahmin 
ANV were differentiated by Rlala-M17 vector. The Brahmin Dravida and the 
watrior group Raju are differentiated by HG G-M201 and F*-M89 vector 
respectively. The major wet/dry land framing groups (Kapu and Kamma) are 
differentiated across HG L1-M27/76 vector. Fig. 32b describes the scree plot 


representing the variances contributed by each PC component. 


70 


XLIJLU ISY WLS A OU} syuosoidas oSuewy soddn pue xwyewl OH AYN OU} sjuososdas o[SueLy IOMOT 
OJON, 


L100 60'0 oe! 


Soour}sIp S.J DH AUN 


Women ¢zro LLY‘0 ~ 9zr0 ae 0610 


SOOULISIP ISY poseq WLS AYN 


suoyeindod Apnyjs ysopeig viypuy 10} S9dUeISIP ISY WLSA PUL ISA OH AYN UO poseq xLHvUI 9dUv)SIP ISIMAIV 2ST BqQUL 


Figure 32a: Principal Component Analysis based on YHG frequencies of study 
populations from Andhra Pradesh 
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Figure 32b: Scree plot for PCA (Figure 33a) components 
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Note: The tribal populations are indicated by squares and circles represent the caste popula- 
tions. The percent variance contributed by each PC is given by scree plot. The biplot showes 
the variance contributed by each HG represented by lines as component loading vectors. It is 
to be noted that the study populations of Andhra Pradesh are clusterd based on their mode of 
subsistence 


4.1.4.5 Multidimensional scaling: 


Fig. 33 shows the MDS plot which is coloured based on their mode of 
subsistence. Stress value obtained for three dimensional NM MDS was 10.61911. 
This plot showed clear distinction of wet/dry land farming groups, SC populations, 
Brahmins, Warrior and tribal related groups. The clustering pattern for wet/dry land 
farming populations was similar to that of PCA. The SC populations like the Relli and 
Jalari were distanced to other SC populations-Mala and Madiga. This scenario could 
arise when the population undergo long term isolation and YSTR evolution or 
admixing of migrants groups with similar HGs but different YSTR signatures. This 
showed that these populations had similar NRY HG profile but distinct YSTR 
signatures. The Brahmins and Raju populations were distanced with respect to each 
other and to other populations when compared to PCA. 
4.1.4.6 Phylogenetic Networks 
NRY HG C5-M356: This HG was sporadically present in the study populations. No 
population specific clusters were found. The network was characterised by long 
radiating branches, indicating distant source and limited YSTR evolution or loss of 
haplotypes due to drift. 

NRY HG F*-M89: This network (Fig 34a) was characterised by the presence of 
warrior class- Raju, at the middle of the network. They also showed population 
specific cluster with single step mutations indicating isolation and YSTR evolution 
among them. The other study populations did not show such patterns of YSTR 
evolution and probably had different sources of the haplotypes 

NRY HG H1a*-M82: This HG was represented by majority of the study populations. 
The network showed a hypothetical central node with radial branches (Fig 34b). 


Population specific clusters were observed among Relli (SC) and Jalari (Fishermen, 
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Figure 33: Multi Dimensional Scaling of NRY-STR —Rst distances for populations of 
Andhra Pradesh 
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Note: The tribal populations are indicated by squares and caste populations are indicated by 
circles. The YSTR based proximity of the populations on their mode of subsistence is indi- 
cated by the MDS plot. Tight cluster were formed among the wet/dry land farmers and 
tribal populations. The brahmin groups were diverse. The SC population were distributed 
widely as similar to PCA 


Figure 34a-34e: Phylogenetic networks for HGs F*-M89, Hl1a*-M82, L1-M27/76, O2a- 
M95, Rlala-M17 


Figure 34a: NRY HG F*-M89 
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Figure 34b: NRY HG H1la*-M82 
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SC). Haplotype sharing was observed among one sample each of Mala and Madiga 
(both SC); two samples of Raju (warrior) and Relli (SC) and one sample each of Mala 
and Yadava. This network describes evidences for gene flow among populations. 
NRY HG J2a*-M410: This network showed long branches with multiple unoccupied 
steps without central median node. This HG was over represented in warrior class- 
Raju with YSTR evolution among them. Brahmin Dravida also showed high 
representation, but unlike Raju they did not show isolated YSTR evolution. They 
probably had distant source for these haplotypes. 

NRY HG J2b-M221/102: The haplotypes of this network did not show any 
population specific cluster. However it was over represented in SC populations (Mala 
and Madiga) and wet/dry land farmers (Kamma and Settibalija). No haplotype sharing 
was observed among these populations, indicating that there could be different 
sources for this HG. 

HG L1-M27/M76: This network was characterised by a hypothetical central median 
node, with branches radiating around them (Fig.34c). The network was mainly 
composed of wet/dry land farmers (Kapu and Kamma). Kapu showed shorter branch 
lengths and step wise mutations indicating long term isolation and evolution of their 
YSTRs. Kamma were represented by long branches with multiple unoccupied steps 
indicating different source for their haplotypes within HG Llor genetic drift. Raju, 
formed a population specific cluster with YSTR evolution and minimal sharing of 
their haplotypes, indicating evolution of their YSTRs. 

NRY HG O2a-M95: Three distinct clusters were identified in the network: tribal 
populations- Konda kammara, konda Reddy and SC population- Jalari. There was no 


YSTR sharing among these populations, showing evolution in different directions. All 
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Figure 34c: NRY HG L1-M27/76 
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Figure 34d: NRY HG O2a-M95 
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these features indicated that there could be different sources of haplotypes for these 
populations (Fig.34d). 

NRY HG Rlala-M17: The central node of this network was occupied by Kapu, a 
wet/dry land farming population (Fig 34e). Kamma, another wet/dry land farming 
population formed a population specific cluster. Linear branches with multiple steps 
of evolution along the network showed YSTR evolution and spread of this HG. 

NRY HG R2-M124: The haplotypes of HG R2 were spread across all the study 
populations. No population specific clusters were identified. Majority of the 
populations showed long radial branching with multiple unoccupied steps from a 
hypothetical central node. 


4.1.4.7 Mismatch Distributions 


Mismatch distribution analysis of NRY HGs F*-M89 (MPD:11) showed 
bimodal peaks and high MPD values indicating that there may be unidentified 
haplogroups within this paragroup (Fig 35a-35j). These populations have atleast two 
different sources of haplotypes for this paragroups. HG C5-M356 (MPD: 9.5) 
showed multimodal peaks indicating diverse sources of loss of haplotypes by drift. 
HGs such as J2a*-M410 (MPD: 10.1), J2b-M221/102 (MPD: 9), O2a-M95 (MPD: 
7.05) also showed multiple peaks and also high MPD values. The haplotypes could 
have had multiple sources within the haplogroup. HGs Hla*-M82 (MPD 8.1), L1- 
M27/76 (MPD 7.8), Rlala-M17 (MPD 7.4) showed a unimodal peak indicating long 
term demographic expansions among the populations of Andhra Pradesh within these 
NRY HGs. 


4.1.4.8 BATWING and ASD analysis 


The populations of Andhra Pradesh showed a coalescence time of ~8.2Kya. 


The total ancestral effective population size was 13,701 (95% CI: 12,956-14,157) 
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Figure 35a-35j: Mismatch distributions of YSTRs for HGs for study populations 
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(Table 8). Whereas the TMRCA was found to be 61,352Ybp (95% CI: 59,131- 
64,775) and the population expansion times was 32,671Ybp (95% CI: 26,388- 
35,167). In the phylogentic tree computed by BATWING (Fig 36) it was observed 
that Konda Reddy tribe showed a distinct line of evolution of 8.2 Kya. The Brahmin 
populations of Andhra Pradesh showed a coalescence time ~SKya. Kamma and Jalari 
showed a recent time split of 1.7K ya. Whereas, the wet/dry land farming populations, 
Kapu and Yadhava, showed a relatively deep split time of 3.3Kya. The SC 
populations (Mala and Madiga) also showed a recent split of 2.4Kya. 

The two Brahmin populations had only one HG (Rlala-M17) as the common 
HG. The ASD was markedly different among them (Brahmin ANV: 12.1+3.4 Kya 
and Brahmin Dravida: 3.4+1.4 Kya) (Appendix 11) suggesting that these two 
populations had their own unique histories, that was also supported by the gene 
frequencies, PCA and MDS plots (Fig 33, 34). HG Hla*-M82 was present in the 
majority of the populations of Andhra Pradesh. The ASD estimates for HG Hla*- 
M82 was different in various populations with three of them showing > 20Kya, four 
showing 10-20Kya and the others showing <10 Kya. This indicated long term 
expansions and markedly different sources of this HG in Andhra Pradesh. 

Overall the results suggest the populations of Andhra Pradesh were 
differentiated based on their mode of subsistence with no gene flow detected for at- 


least 2000 years. 
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4.2: NRY HG L1-M27/76 Story and India 

The results described in the previous section on the populations from Deccan 
region of India has shown that the HGs Rlala-M17, Hla*-M89 and L1-M27/76 as 
the most common HGs. As the present study region is majorly inhabited by Dravidian 
speakers, I performed a Pan Indian analysis of HG L1-M27/76 that has been 
previously described as a marker for Dravidian speakers of India by Sengupta et al, 
(2006), so as to decipher the origin of these speakers. However in the study by 
Sengupta et al, (2006), the representation of samples from various geographic regions 
of India was not appreciably large (L1-M27 sample size was 55). In the present study 
I investigated 611 NRY HG-L1-M27/76 chromosomes from a total of 5,099 samples 
studied under the Genogrpahic project from this laboratory. 212 samples came 
directly from my study. 
4.2.1: Distribution of NRY HG L1-M27/76 

The distribution of NRY HG L1-M27/76 present in the Indian populations was 
analysed from the ‘Genographic’ study (present study: 611, studied by others: 28). 
The Pan Indian data compiled from the Genographic Indian centre showed an over 
representation of samples from Tamil Nadu, where the sample size was almost twice 
the total samples studied from other states of India (Table 16a,b). Hence to reduce the 
effect of uneven sample size on data analysis, I reduced the Tamil Nadu samples from 
each population to one third randomly. Thus the total N studied from Tamil Nadu was 
reduced from 1,356 to 344 and hence a total of 404 L1-M27/76 chromosomes from 
4,772 samples were studied from across India. Table 15 shows the L1-M27/76 
frequency from the truncated Tamil Nadu data and those of other parts of India. A 
predominant presence of HG L1-M27/76 with a frequency of > 5 was observed in 


Piramalai Kallar (N=24), 0.2 - 0.4 frequency in 20 study populations and 0.1 — 0.2 in 


de 


Table 16a: Frequency distribution of HG L1-M27/76 in India and literature data 
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28 study populations, whereas the 56 populations frequency of 0.9 and above. HG L1 
was seen majorly in caste populations (FET 1.E-15) than in tribes. Linguistically, it 
can be associated with South Dravidian language (SDR) (Table 17). 

As a region the Deccan, showed the highest frequency of HG L1 compared to 
northern Indian states. Tamil Nadu (L1-M27/76 frequency: 19.1%, FET 2.E-10) 
followed by Andhra Pradesh (11.4%, FET 2.E-17), Kerala (10.7%, FET 5.E-04) and 
Karnataka (8.95, FET 1.E-09) all showed appreciable frequencies: select population’s 
in-fact had higher proportions (Table 16a, b). The high frequency L1-M27/76 in 
Tamil Nadu could be attributed to the expansion of this haplogroup in this region or 
recurrent migrations of this HG carrying people. The distribution of L1-M27/76 was 
represented in a contour map (Fig 37a). The highest frequency was identified in Tamil 
Nadu region and there was a decrease in the frequency of this HG as one moves north. 
4.2.2: L1-M27/76 17Y-STR variance distribution: 

The 17-YSTR variance averaged over all loci is represented in the Fig 37b as 
a contour map. The highest YSTR diversity was observed in Tamil Nadu and as with 
the frequency contour plot, there was a decrease in the variance as one moves north. 
To determine if the variance observed was a result of multiple events of admixture/in- 
migrations the Sum of Squared Difference (SSD) from the median for each region 
was calculated and the frequency of haplotypes observed in each mutational step is 
represented in Table 18. 

The sum of square differences (SSD) estimate based on YSTRs allows one to 
show the distances of haplotypes from the median haplotype. The median haplotype 
obtained from the study samples was considered as the founding haplotype (Sengupta 
et al., 2006). For each geographical region, the number of samples that were present 


in various distances from median haplotype was counted and expressed as frequency 
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Figure 37a: Contour map of NRY HG L1-M27/76 based on its frequency 
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Figure 37b: Contour map of NRY HG L1-M27/76 based on YSTR variance 
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(Table 18). Tamil Nadu, Andhra Pradesh and Karnataka had each one sample in the 
median. Karnataka populations showed an expansion of haplotypes with a distance 
range of 0-16, while Tamil Nadu samples showed haplotype spread from 0-30 steps, 
the maximum identified in the study. The other regions of Deccan — Maharashtra and 
Kerala showed discontinuous distribution from 1-10 and 1-7 respectively. In contrast 
to the Deccan samples, the North India and Bihar regions showed a discontinuous and 
sporadic distribution at various distances, without much continuity. Further 
haplotypes closer to median were also not found among them. 
4.2.3: L1-M27/76 Haplotype network analysis 

Many reduced median networks were computed to determine the relationship 
of L1-M27/76 haplotypes with each other. All the networks were computed using 
reduced median algorithm with reduction threshold set to 1. The first network was 
computed using all the 611 samples from the Genographic India study (Fig 38). In 
this network Tamil Nadu samples were strewn all over the network: Nonetheless the 
centroid / median haplotype was formed by three samples from three distinct regions 
(Tamil Nadu, Karnataka and Andhra Pradesh). As explained previously the Tamil 
Nadu samples were highly over represented. Hence the second network was computed 
with the reduced data (N= 404) (Fig 39). The picture did not change much from the 
other network, but the north and west Indian L1-27/76 lineages were well 
differentiated in this network suggesting a unique line of evolution. In the 17 STR 
haplotype, with or without truncated samples showed a central median node 
comprising samples from Andhra Pradesh, Karnataka and Tamil Nadu (median 
haplotype 12,16,22,15,14,15,15,10,19,12,10,14,11,12,24,12,11 for YSTRs, D389a, 
D389b, D390, D456, D19, D458, D437, D438, D448, DH4, D391, D392, D393, 


D439, D635, D388, D426). The median haplotype was constituted by Kapu (Andhra 
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Pradesh), Kuruba (Karmataka) and Ezhava (Tamil Nadu) populations (Table 19).The 
other haplotypes radiate around this node in different tiers. The pattern was very clear 
in Tamil Nadu truncated network (Fig 39). The presence of L1 samples from Deccan 
in all radiating branches of the RM network makes Deccan as the candidate of origin 
of Li. 

To determine the origin and expansion of this haplogroup the Indian 
haplotypes were compared with those of the literature, such that the L1-M27/76 
samples from regions not sampled in the present study were included. A set of 9 
common STR haplotype data was available in literature was compared with the Indian 
dataset by a network analysis (Fig 40). In the RM network, the Afghanistan, Pakistan 
and Syria populations shared L1 haplotypes at the second mutational step from the 
median. 90% of the median was composed of samples from the Deccan, 75% of them 
being from Tamil Nadu, Karnataka and Andhra Pradesh. Further the IE speaking 
north Indian populations showed a discrete radiation from the median, with specific 
and well defined divergence on its radius. It is to be noted that that 9 YSTRs that 
were compared with literature data once again showed only Indian that too Deccan 
samples (Sans 2 Assam samples) as the median haplotype. 

To determine if reduction of STRs (17 to 9) in the previous network created 
any bias on the analysis, another network with the remaining 8 STRs of the Indian 
samples was computed (Fig 41). In this network, the extent of sharing within the 
study states varied when different set of YSTRs were _ considered 
(ie.,D456,D458,D437,D438,D448,DH4,D635,D426). Importantly the samples within 
the median changed, although the predominance of Deccan region was observed. 
Similarly the North and West Indian populations formed a separate cluster. Hence 


with all the 17 YSTRs the stringency was high, eliminating the Maharashtra, Gujarat 
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and Assam populations from the median. Thus the choice of YSTRs and their 
application in deciphering the origin of haplogroups have to be dealt with caution and 
the YSTRs employed in this Genographic study was more informative than the 9 
STRs that the people have used yesteryears. 


4.2.4: MultiDimensional Scaling 


A MDS plot was computed based on YSTR-Rst distances for the common 9 
STRs of HG L1-M27/76 chromosomes obtained in the present study along with those 
of available in literature, on Afghanistan, Pakistan and Syria (Fig 42) for 9 YSTRs. 
Populations with a sample size of at least 3 were considered for the computation. A 
stress value of 15.71 for three dimensions (k=3) was obtained. Pakistan and Syrian 
populations clustered along with the populations of Deccan, India. The populations of 
Tamil Nadu were widely distributed in the middle of MDS plot, indicating diverse 
STR evolution whereas the populations of Karnataka formed a relatively tight cluster. 
Most of the Brahmin populations were outliers. The high variance observed in the 
south Indian population advocates for a long term evolution of L1 STRs in them. 
Unfortunately 17 STRs were not available in the published literature to further refine 
the affinity of the Indus valley and Syrian populations. The genetic distances of these 
populations studied was further depicted in NJ tree that was computed based on Rst 
distances. 


4.2.5: STR based age estimates 


Table 20, 21 presents the YSTR variance, ASD based age, effective 
population sizes and population expansion time estimates of the study populations and 
literature data. The total HG L1 -YSTR variance for 9 YSTRs was 0.35 and ASD was 
14,127+4,066 Ybp. High ASD estimates were observed in SDR speakers of Tamil 


Nadu populations followed by SDR speakers of Karnataka and Andhra Pradesh. 
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Figure 42: Multi Dimensional Scaling of HG L1 based on Rst distances of global 
populations (9 YSTR) 
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Note: Populations with N>3 were used. Populations outside India are indicated as squares 
and diamond. Indian populations are indicated by circles. The population represented by code 
are mentioned in Appendix 16. It is to be noted that the majority of the Brahmin populations 
fall towards the periphery of the cluster. Relatively, Tamil Nadu populations were distributed 
videly than Karnataka populations 


Syrian populations do show matching ASD as that of Karnataka populations 
(12,400Ybp) but they exhibited low YSTR variance. Similarly, South Pakistan and 
Andhra Pradesh populations also showed ASD estimate of ~10,000 years, but again 
the YSTR variance of Andhra Pradesh was 1.6 times higher than Afghanistan 
populations. Higher sample sizes from these regions are warranted. Among the Indian 
study populations, IE speakers of Uttar Pradesh showed higher variance (0.60) and 
ASD of 33,000+18,300 Ybp (9 YSTR). But as mentioned SSD estimates that high 
variance in these regions was attributed to multiple sources of HG L1 among these 
populations. 

When 17 YSTRs were considered, the overall HG L1 variance for Indian 
populations (0.4) was found to be equal to the variance of SDR speakers of Tamil 
Nadu (0.41), Andhra Pradesh (0.4) followed by Karnataka (0.36). The ASD of Tamil 
Nadu was found to be the highest (20,0004 4,700Ybp). However, Kerala showed 
higher variance than the pooled variance of HG L1 itself. This could be the result of 
gene inflow into the populations of Kerala from various sources. The total variance 
and ASD estimate for 17 STR were 0.4 and 14,3824 3,088 Ybp. Whereas the total 
variance and ASD of 9 STR data sets were 0.35 and 14,127+4,066 Ybp though this 
set included outside India data as well. The high variance and ASD estimates in Tamil 
Nadu again suggest, an origin of L1-M27/76 here. 


4.2.6: BATWING estimates of population parameters and phylogenetic tree 


BATWING age estimates showed a higher TMRCA and population expansion 
time in Tamil Nadu showed (52,200 95% CI: 29,300-1, 03,475 and 17,325Ybp 95% 
CI: 10,125-30,825 respectively), which is in consensus with previous analyses. Indian 
populations showed a higher effective population size of 1,760 yrs (95% CI: 712- 


4151), higher population expansion times of 10,012 Ybp (95% CI: 5,664-2,0240) and 
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Figure. 43 NJ tree based on HGL1 — M27/76 Rst distances for all Indian 
and literature data (9 YSTRs) 
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higher YSTR variances than populations of Syria, Pakistan and Afghanistan, thus, 
ruling out the possibility of origin of L1-M27/76 outside India (Table 22). Deccan 
populations showed higher TMRCA, effective population size and population 
expansion time than the North Indian populations (yellow cluster in network Fig 39), 
thus confirming south India as the place of origin of HG L1-M27/76. The present 
study has suggested Tamil Nadu or Karnataka as the most probable region of HG L1, 
with concomitant expansion in Andhra Pradesh. The exact region of origin within 
South India could not be deciphered with the present set of markers employed. One 
may require more L1 subtype SNPs and more informative STRs. It will be worth to 


further investigate the same cohort with these new markers in future. 
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4.3 Genetic footprints of HG L3*-M357 - An Enigma 


4.3.1 Distribution of NRY HG L3*-M357 


NRY HG L3* is identified by M357 SNP marker (subtype of HG L-M20). 
Though certain studies have identified the presence of this HG in India, Pakistan and 
Afghanistan, their pattern of migration was not characterised. Hence in this chapter, I 
have studied 210 Y chromosomes from India and 45 samples available from literature 
to address this question. In India, HG L3*-M357 was present in higher frequencies of 
>10% in the Northern Indian states of Jammu (19.8 % FET 6.E-23) and Punjab (14%, 
FET 3.E-06. The other states showed <6% frequency of HG L3*-M357 (Table 23). In 
contrast to HG L1-M27/76, the frequency of HG L3*-M357 reduces towards south. 
These frequencies are also reflected in contour map based on HG L3*-M357 
frequency and YSTR variance (Fig 44a, 44b). However, higher YSTR variance was 
observed in Himachal Pradesh (0.38), Rajasthan (0.31) Jammu (0.23), and Kerala 
(0.35). HG L3*-M357 was present in the frequency of 4.44% (FET 8.E-20) among IE 
speakers and 1.79% (FET 7.E-03) in South Dravidian speakers (Table 24). 


Table 24: Frequencies of NRY HG L3*-M357 in various language speakers of 
India 


Language | N studied | L3* p-value 
AA 570 0.00 8.E-07 
CDR 461 0.00 1.E-05 
SDR 2793 1.79 7.E-03 
IE 3246 4.44 8.E-20 
TB 1312 1.14 4.E-04 

Note: 

HG L3*-M357 is represented in high frequency in IE 

speakers 


'TB: Tibeto Burman language 

* TE: Indo European language 

> AA: Austro Asiatic 

* SDR: South Dravidian languages 

> CDR: Central Dravidian languages 
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Table 23: Frequency distribution of NRY HG L3*-M357 in India and literature data 
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Note: 

' TB: Tibeto Buraman language 
* TE: Indo European language 

> AA: Austro Asiatic 


* SDR: South Dravidian languages 
p values >0.05 were removed 


Figure 44a: Contour map showing the distribution of HG L3*-M357 in India based on its 
frequency 


Note: Populations with the sample size of atleat 5 were used for constructing the contour 
plots. High L3*-M357 frequency and variance was observed in Jammu, Himachal Pradesh, 
Punjab and Rajasthan in north India; Kerala and Karnataka in south India 


To estimate the relative distances of each study region from the median 
haplotype (the most probable ancient haplotype) using 17 YSTRs, SSD was 
calculated (Table 25a). Andhra Pradesh (N=5 only) was represented at the minimum 
distance but the distribution was sporadic over the spectrum. Tamil Nadu had 
haplotypes representing continuously from SSD distances 1-10. Karnataka showed 
continuous SSD distances from 1-3 and again from 5-9, at later SSDs the distribution 
was sporadic. Whereas, north Indian populations do not show the presence of median 
haplotype. Overall the results showed lower levels of evolution. 

The phylogenetic relationship among the study regions was deciphered in the 
network analysis (Fig 45). The network showed two branches, evolving in opposite 
directions from a hypothetical central node. Geography specific clusters with minimal 
gene flow between these two clusters were observed. Jammu, Himachal Pradesh and 
Punjab clustered together (cluster 1 in Fig 45). All the populations in this cluster were 
IE speaking populations. Whereas, cluster 2, was comprised mainly the samples from 
Deccan with an over-representation from Karnataka. 

4.3.2 Comparison of HG L3*-M357 with the global populations 

SSD values were calculated based on 9 YSTRs for the global populations with 
the data available from the literature (Table 25b). Higher proportion of samples from 
Deccan, Afghanistan and Pakistan regions gave the indication of the presence of 
ancient median haplotype. This was further reflected in the network analysis. Among 
North Indian populations, only Punjab showed continuous spectrum of SSD distances 
(1-6), whereas the others were distributed sporadically. 

The network analysis form literature data with 9 YSTR data showed the 
central node comprising populations from Himachal Pradesh, Punjab and Rajasthan. 


Surprisingly, no haplotypes from these study regions were shared with its 
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geographical neighbours: Pakistan and Afghanistan (Fig 46, Table 26). The 
populations of Karnataka shared its haplotypes with populations of Afghanistan, 
Pakistan and East Caucasus. Jammu formed a distinct cluster with over-representation 
from a pastoral based unit of Himalayas- the Brokpa. The median haplotype was 
shared by Punjab, Himachal Pradesh and Rajasthan populations. The compositions of 
this median haplotype are presented in Table 26. The median haplotype was for 
YSTRs D389a, D389b, D390, D19, D391, D392, D393, D439 and D388 in cluster 
l(north India) was 13-16-22-15-10-14-12-12-11 and that of cluster 2 (Deccan, 
Paksitan, Afghanistan) was13-16-22-15-10-14-12-12-12. The north Indian and south 
Indian cluster differed at D388 locus by single step mutation. 

The Rst genetic distances of study regions along with data from literature were 
displayed by MDS plot (Fig 47). Most of the north Indian populations i.e., Punjab, 
Rajasthan, Himachal Pradesh and Jammu were seen clustered together and the Deccan 
populations formed a distinct cluster. Brahmin Havyaka, an IE speaker, Chechen — 
East Caucus populations and Rajasthan populations also clustered with the north 
Indian group. Gowda of Karnataka was close to Pastun of Afghanistan. 

The pooled ASD estimate of HG L3*-M357 in Indian populations was 13,376 
+ 3,181 for 17 YSTR loci. In India, higher variance and ASD was found in Himachal 
Pradesh populations (14,919 + 4,740). Taking into account the study regions and 
other neighbouring regions (9 YSTRs), Afghanistan showed highest ASD age (15,200 
+ 4,400). Himachal Pradesh of India showed the variance of 0.21 and ASD of 8,891 + 
3,880. The ASD ages were in South Indian study states were lower. But Kerala of 
Deccan India showed higher variance (0.42) and ASD (18,100 + 7,000) than the 
pooled YSTR variance and ASD of the haplogroup itself. This indicates the 


possibility of gene flow into this region. This estimate has to be looked more 
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Table 26: Composition of various populations represented in network Figure 46 
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Figure 47: Rst based MDS plot for HG L3* for Indian along with global 


populations for 9 YSTRs 
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Note: The populations with the sample size of atleast 3 were considered for this 


analysis. The population codes are mentioned in appendix 12 


cautiously as the sample size from Kerala was only 7. More sample size from Kerala 
may throw further light on this issue. 

BATWING based phylogenetic tree (Fig 48) showed affinity of Pakistan and 
Afghanistan populations to South Indian populations. Jammu, Himachal Pradesh, 
Punjab and Rajasthan showed a distinct separate cluster a coalescence time of 18,789 
years. The TMRCA of all the studied states samples was 21,796 (21,251-23,010) and 
population expansion time was 19,914 (19,245-20,296) (Table 27). The evidences 
thus based on network, batwing etc., indicates an origin for HG L3*-M357 ~18,700 
Ybp and soon after formation two distinct migration took place from Afghanistan to 
Deccan India, presumably coastal route, and the other towards north Western and 
Northern frontiers of India. The most probable route was from East Caucasus 9,421 
Ybp (Fig 48) via Pakistan, Karnataka and branching from there to Kerala,Tamil Nadu 


on one hand 5,000 Ybp and Afghanistan and Andhra Pradesh on the other hand. 
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4.4 Origin and dispersal of NRY HG J2a*-M410 

The spread of HG J has been associated with agriculture. Hence to decipher 
the distribution pattern in India and its association with agriculture, I have studied HG 
J2a* and HG J2b which includes 329 HG J chromosomes from Genographic study 
populations of the total 5,033 samples studied. I have included 231 YSTR data from 6 
regions from literature data for comparative analysis and derive a holistic picture. The 
frequencies of the HG used from literature along with their references as mentioned in 
review of literature section. 
4.4.1 Phylogeography of HG J2a*-M410: 

Distribution of NRY HG J2a*-M410 among Indian populations is presented in 
Table 28. HG J2a* was distributed mostly in caste populations (5.8%) whereas in 
tribes it was found to be 1.7%. Linguistically IE speakers (6.2%, FET 7.E-18) in India 
(Table 29). Fig 49a represent the contour plots based on YHG frequencies of J2a* in 
various study states. Gujarat populations such as Maldhari (58%) and Patel (26.8%) 
showed high frequencies. >20% frequency was observed in four populations of 
Himachal Pradesh. The frequencies decrease towards northeast and Southern Deccan. 
The STR variance based contour plot (Fig 49b) showed two hot spots of high average 
STR variance at Uttar Pradesh and Maharashtra. High frequency (33.7%) and YSTR 
variance (0.70) in Maharashtra was attributed mainly by Parsee population. 


4.4.2: Phylogenetic network analysis: 


A phylogenetic reduced median network of 17 STRs of HG J2a*-M410 at pan 
India level (Fig 50a, 50b) revealed, median connecting link with reticulations but no 
samples. The populations were represented in the consecutive three tiers around this 


median. The innermost tier contained south Indian populations, the middle tier and the 
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Figure 49a: Contour map showing the distribution of HG J2a* in India based on its 
frequency 
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Figure 49b: Contour map based on YSTR variance of HG J2a* in India 
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Figure 50: Reduced median phylogenetic network for HG J2a*-M410 for Indian 
populations based on 17 YSTRs 
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last tier were represented by west Indian and north Indian populations showing 
multistep mutations and concerted evolution within the population. 
4.4.3 Comparison with Global populations: 

The absence of samples in the median in 17 STRs revealed that the absence of 
founder haplotype in India. To decipher this further and to find the relationships of the 
Indian haplotypes with the global populations, three approached were made: 
phylogenetic network, Multidimensional plot, ASD and BATWING to estimate ages 
using 8 YSTR loci. In the RM network (Fig 51a, 51b), there were two interconnecting 
median, one formed by 11 samples from Lebanon, one each from Palestine, Europe, 
Uttar Pradesh and Karnataka. A huge expansion of this node lead to samples from 
Himachal Pradesh, Europe and Maharashtra. The haplotypes were 13-16-23-14-10- 
11-12-15 for loci D389a, D389b, D390, D19, D391, D392, D393 and D388 
respectively. Whereas the other was comprised mainly of Mauria population from 
Bihar, branches out to many Middle Eastern, European and Indian populations. 
Haplotype sharing was observed between populations of Afghanistan and Indian 
populations, especially west and north India with stepwise mutations. 

The MDS plot computed for these samples was very striking (Stress value: 
15.29). At the resolution of 8 YSTRs, Jammu, Himachal Pradesh, Rajasthan and Uttar 
Pradesh populations were clearly separated from the rest of the populations studied by 
all the three dimensions, but Dimension | more conspicuously (Fig 52). The other 
populations strewn on the left also showed some clustering. Populations from 
Lebanon, east Asia and Europe lie in the middle of the other populations from 
Deccan. Similar picture was also obtained in the NJ tree computed using MEGA 


(Fig 53). 
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Figure 51: Reduced median phylogenetic network for HG J2a*-M410 based on 8 YSTRs 
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Figure 52: Rst based MDS plot for HG J2a* for Indian and global populations for 8 
YSTRs 
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Circles indicate Indian populations. Squares, diamond and triangles represent global 
populations.The population codes are mentioned in Appendix 12. 

It is to be noted that North Indian populations showed distinct YSTR signatures whereas the 
South Indian populations clustered along with global populaitons. 


Figure. 53: NJ tree based on YSTR Rst distances with global populations with 8 YSTRs 
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4.4.4: ASD and BATWING based age estimates: 


The YSTR based variance and ASD age estimates for 8 YSTRs and 17 YSTRs 
are listed in table 30. The YSTR based variance (both 8 and 17 STRs), showed a 
gradual decrease towards east and northeast India. When 8 YSTRs were considered 
within India, Andhra Pradesh, Maharashtra and Kamataka showed higher variance 
(0.62, 0.48 and 0.52 respectively) and ASD estimates (27,390 + 6Ybp, 20,758 + 
4,683Ybp and 21, 874 + 3,097Ybp respectively). Lebanon also showed high variance 
(0.53) as similar to that of Karnataka with ASD of 20,591+5,490Ybp. North Indian 
study states showed lower variance and ASD estimate. When 17 YSTRs were 
considered, Karnataka and Maharashtra showed higher variance (0.5) and ASD 
(~22.5Kya) consistently. Jammu and Afghanistan showed similar variance and ASD 
(0.6, ~25 Kya). Kerala and Bihar also showed higher variances and ASD. To test if 
the high variance in 8 YSTRs in these study states was due to gene inflow, mismatch 
distributions were calculated and represented in Fig 54 with their MPD values. The 
mismatch distribution of Himachal Pradesh, Maharashtra, Karnataka and Punjab 
showed unimodal peak with MPD values of 4.0, 4.9, 5.2 and 2.4 respectively, 
indicating recent demographic expansion in these regions. The mismatch distribution 
of Lebanon also showed a unimodal peak with a MPD of 4.9. Other regions showed 
relatively multimodal peak, indicative the possibility of gene inflow from various 
sources into these populations. 

The ancestral effective population size for NRY HG J2a*-M410 for all Indian 
study populations was calculated to be 37,780 (95 % CI: 25,698-56.378) and TMRCA 
40, 233 (95% CI: 27,338-61,408) with 17 YSTR resolution. The ancestral Na, 


effective population size, TMRCA and population expansion times based on 8 YSTR 
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Table 30: HG J2a*-M410 YSTR variance and ASD +/- SE in different regions 


po STRs CT STRs 


moe Tools] iso] || 
Paesine ——~| 30 | 070 26872] sos] | | _ 
Lebanon ———S—S~s 8 | oss] sao | | 
Afghenisan | _20 | oaa]_asiaf 7| | | 
Pakisin ——~—S*d is | 0] por] aa@] | | 
wt Asa ——S*sC os] itso] asf | | 
fmm ——————*iYt0| 02s] az08a[ 2.717] 0.60] 26.428] 1.600 
Punjab 9 | ona] tar7] arta | 029) 14200 | 5.027] 
TamilNada (| 28 | 0ae] 18.07| 7.23] 069] 0,100] 6800 
pom rt [ 9 [ol arsso[ [oz — at 
North Orisa =~ oa] i6.7s| avo] 047] 19,800) 2.600 
[UtarPradesh———~(| 17 | 0s] 8240] 043] _19.182[ 450] 
Bir ———*ds | 030) 14980] a6] 0] 25503] aca 
asm ——S—S*d 8s | oa] ser] aos] 039] isms] 2.901 
Allindia poked | 329 | 0.83 20625] arse] 0.0] 26.46] 420 
Pooled ASD all populations | 523 | 0499934] aai0f | ‘|_| 


Note: 

Var: Variance 

ASD: Average Squared Difference 

SE: Standard Error 

Only 8 YSTRs were analysed for data obtained from literatue for the purpose of comparison with the present 
study. 

17 YSTRs were used only from India study populations 

Regions with sample size of 5 and above were considered for the analysis 


Figure 54: Mismatch distributions based on 8 YSTR loci of HG J2a*-M410 among all 
study rregions 
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Figure 54: Mismatch distributions based on 8 YSTR loci of HG J2a*-M410 among all 
study rregions 
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loci (Table 31), all were the highest in Lebanon, suggesting Lebanon as the most 
probable homeland of HG J2a*-M410. 

Only a few clades of J2a were seen: HGs J2a4a-M47 and HG J2a4c-M68 
subtypes of HG J2a*-M410 were identified sporadically. It was interesting to note 
that these HGs were mainly present towards Deccan India. HG J2a4a was specific to 
Parsee populations of Maharashtra (9.3%, FET 5.E-11). The average variance of these 
haplotypes was 0.1 with ASD age of 5,062 + 2,743 Ybp. NRY HG J2a4c-M68 was 
present in high frequency among Nilgiri hill tribe of Tamil Nadu- Thoda (50%, 
FET6.E-14) and in a lesser frequency in agriculture based tribe of Karnataka-Yerava 
(9%, FET 6.E-06). Yerava showed an average YSTR variance of 0.48 with an ASD 
based age of 20,887 + 5,900 Ybp. Thoda showed the YSTR variance of 0.10 and ASD 
4,795+1,841Ybp. This fits well with the oral history of migrations of Thodas 
(Arunkumar et al., 2012). Presumably they were the isolates of these migrations from 


Middle East, arriving early before the local evolution and caste formation. 
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4.5 Studies on NRY HG J2b-M221/M102 
In the present study 266 chromosomes of HG J2b-M221/102 were studied of 
the total 4,394 samples. This study includes 57 YSTR data obtained from literature 
for the purpose of comparison of the study populations. The frequencies of the 
populations studied from literature along with their references are mentioned in 
review of literature section. 


4.5.1 Phylogeography of HG J2b-M221/102 in India: 


Higher and appreciable frequencies of HG J2b (~ 20%) were seen in majority 
of the populations of Himachal Pradesh, Punjab and Rajasthan (Table 32). 
Nonetheless, almost all the hunter gatherer populations of Tamil Nadu, many of them 
hill tribes identified in Sangam literatures, showed appreciable (~10%) frequencies of 
this HG. Overall this clade is seen in very low frequencies throughout India (Fig 55a, 
55b), indicating their pervasiveness or remnance. Contour plot based on the 17 YSTR 
variance revealed high YSTR variances among the populations of Jammu and tip of 
Southern India. Linguistically, this HG did not show any affinity with any of the 
language family. 


4.5.2: Phylogenetic network analysis: 


The reduced median network of HG J2b-M221/102 (Fig 56) showed a 
hypothetical median / central node with clear radiating branches in all the directions. 
The 3 to 5 tiers in the network showed many samples most from Deccan and Gujarat. 
Select populations from Northern states of India, particularly Himachal Pradesh, 
Rajasthan and Punjab i.e., the IE speaking belt were all seen evolving mostly in 12 0 
clock, 3 0’ clock and 10 o’clock positions at the terminal and peripheral outer layers 


of the network. 
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Table 32: NRY HG J2b-M221/102 frequency distribution in India 
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Figure 55a: Contour map showing the distribution of HG J2b in India based on its fre- 
quency 


Note: HG Db M2217 102 was seen in low frequencies in India. YSTR variance was high in 
Himachal Pradesh and Southern tip of India 


Figure 56: Reduced median phylogenetic network for HG J2b in Indian study 
populations (17 YSTRs) 
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ote: Each circle denotes haplotypes represented by . Size of the circle 1s proportional to the haplotype 
frequency. South Indian populaitons are represented in the inner circle all around the central hypothetical node. 
North Indian populaitons are represented mainly in 12’0 clock, 3’o0 clock nad 9’o clock position 


4.5.3 Comparison of HG J2b with global populations 


Considering 8 STR haplotype of J2b of the present study along with those 
available in literature revealed some interesting observations. The network in Fig 57 
showed a median node. The YSTRs of the median node was 12-16-24-15-10-11-12- 
15 for YSTRs D389a, D389b, D390, D19, D391, D392, D393 and D388. The median 
haplotype was shared by 5 Europe (Ashkenazi), 10 Tamil Nadu, 8 Rajasthan, 5 
Greece, 3 Andhra Pradesh and Punjab, 2 from Assam and Karnataka, 1 each from 
Uttar Pradesh, Cyprus, Arunachal Pradesh, Gujarat, Maharashtra, Lebanon, Himachal 
Pradesh, Palestine and West Eurasia. Further there were many radiating and 
expanding branches at various tiers of evolution, but most of the nodes consisting 
samples from both India and other regions studied. This haplotype sharing suggesting 
recent common ancestors and extensive geographical spread of this HG in India and 
nearby regions. 

Two dimensional MDS plot of the data gave a stress value of 20.27 and hence 
the dimensions were increased to 3 and this reduced the stress value to 13.29. 
However, the picture was not as impressive as that of with J2a (Fig 58). The Northern 
Indian, particularly the Himachal and nearby region samples and the middle eastern 
and European populations appear in the lower half, discriminated by Dimension 2. 
The NJ tree computed based on the Rst distances (Fig 59) shows that the affinity of 
north Indian populations to global populations than to Deccan region. 

4.5.4 ASD and BATWING based age estimates: 

In calculating variance, Palestine populations showed higher variance (0.28) 
and ASD of 14,153 + 6,430 Ybp (Table 33). Among Indian study states, Karnataka, 
Tamil Nadu, Maharashtra and Himachal Pradesh showed high STR based variance 


based on both 8 and 17 YSTRs. Both Himachal Pradesh and Tamil Nadu showed an 
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Figure 57: Reduced median phylogenetic network for HG J2b for Indian and global 
populations (8 YSTRs) 


Bi Spain 

B East Caucasus 
DO Lebanon 

D Middle East 
O India 

B Turkey 

B East Asia 

B West Eurasia 
D West Caucasus 
D Greece 

Bi North Africa 
B Europe 


Populations from different regions of %& 


India are coloured 


Figure 58: Rst based MDS plot for HG J2b for Indian and global populations using 
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populaitons. Population codes are represented in Appendix 16. 


Figure.59: NJ tree based on YSTR- HGJ2b-M221/102 Rst distances with Indian 
and global populations for 8 YSTRs 
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Table 33: NRY HG J2b-M221/201 ASD based age estimates in different geographical regions 
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Var: Variance 

ASD: Average Squared Difference 

SE: Standard Error 

Only 8 YSTRs were analysed for data obtained from literatue for the purpose of comparison with the present 
17 YSTRs were used only from India study populations 
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ASD of ~20 Kya with 17 YSTR. But only Rajasthan and Tamil Nadu populations 
showed a unimodal peak in their mismatch distribution for with MPD of 2.7 and 3.6 
suggesting recent expansion among in these regions (Fig 60). But the YSTR variance 
(0.21) and ASD (8,907+1,460) was lower in Rajasthan than Tamil Nadu. Palestine did 
not show a smooth unimodal peak, so the higher variance could be attributed to gene 
flow also 

The BATWING analysis for 17 YSTRs for all Indian HG J2b populations, the 
effective ancestral population size was calculated to be 4,358 (95% CI 28,575-6,758) 
and TMRCA 30,952 (95% CI 20,598-47,345). 5 sets of BATWING computations 
were further carried out for each region. Lebanon showed the highest ancestral 
effective population size and TMRCA (Table 34). But the population expansion time 
was higher in India (15,431: 95 CI: 8,158-30,912). Thus the overall results indicate 


long geographical spread of this HG and had expanded for a longer time in India. 


a2 


Figure 60: Mismatch distribution based on 8 YSTR loci of HG J2b-M221/102 in all study states 
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Figure 60 cont 


Maharashtra 


Rettree frequency 


= MPD: 4.244 


‘ 
3 
| ie 
5 ‘ 7 8 


Relative frequency 


on7i89 


ns122 


ontset 


Karnataka 
T MPD: 4.698 


Tamil Nadu 
= MPD: 3.651 


02176 


osesne 


oxeart 


onaite 


snes 


onoes2 


07089 


nares 


002383 


Andhra Pradesh 
I 2 MPD: 3.574 


+ “ 


+0 


North Orissa 


Relative frequency 


ost 


Ozer 


02st 


o2zss7 


0.20000 


ones 


0.14286 


onsaze 


ooesrs 


oosres 


ones? 


MPD: 3.905 


6 
5 
4 ‘ ‘ 
1 ‘ 1 
° ° i a 
r r 
1 2 3 4 . “ ? 


Uttar Pradesh 


Relatve frequency 


o.seser 


0.33333 


MPD: 3.667 


Assam 


MPD: 3.600 


Table 34: Effective population sizes, TMRCA and population expansion 
times based on BATWING analysis for HG J2b-M221/102 (8 YSTRs) 


EECOINE Population expansion 
Region population size TMRCA 4 : P 
time 
(Na) 


373 (114-1302) 17,554 (6,643-49,417) 10,011 (1,746-35,899) 


68,012 (28,571-1,78,830) 8,376 (1442-42,935) 


The ages were computed using different BATWING runs and the results need to be 
interpreted with great caution 


4.6: Nattukottai Chettiar- A case study on caste formation 

In terms of genetics, caste can be defined as an inbreeding unit. South India is 
characterized by a rigid caste system and endogamy. The formation of these caste 
units may not be uniform across India. However to investigate the formation of caste 
based on Y chromosome, I investigated the community of Nattukottai Chettiar (NC) 
to study their NRY profile. The ethnographic notes are presented in Table le. The 
community is divided into 9 patrilineal clans or Kovils and sub-clans. Each clan being 
exogamous but practise caste endogamy. Fig 61 shows the distribution of various 
Kovils of NC in Chettinad region of Tamil Nadu. 


4.6.1 NRY HG frequency distribution in various sub-clans of NC 


Table 35, 36 shows the Y HG frequency and Fisher’s p value based on clan 
and sub-clans. It was interesting to note that some YHGs are specific to only certain 
clans. Mathur _Arumbakur showed 100% of HG F*-M89 (FET 1.E-11). HG Hla*- 
M82 showed 100% representation in Mathur Uraiyur (FET 1.E-08) and Mathur* (* 
refers to unclassified clan) (FET 2.E-01). HG J2a*-M410 shows 100% in 
Elayatankudi_Okkur (FET 7.9E-09). HG J2b-M221/M102 was mainly localised in 
Vairavan clan. Mathur Manalur, Elayathankudi and Surakudi mainly showed HG L1- 
M27/76. HG O2a-M95 was present in Mathur Kulathur in a frequency of 100% 
(FET 3.6E-08). Similarly HG  Rlala-M17 ~~ was _ represented in 
Elayathankudi_Kalanivasal and Erani Kovil. Whereas Mathur Karupur and Nemam 
possessed HG R2-M124. The YSNP Nei gene diversity was thus nil for many of the 
clans. 


4.6.2 Neighbour Joining tree computations 


NJ trees were computed based on both YSNP Fst and YSTR Rst distances 


(Fig 62a, 62b). Three major clusters were observed in NJ tree based on Fst distances. 
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Figure 62a: NJ tree based on NRY HG Fst distaces for subclans of Nattukottai Chettiar 
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Figure 62’b: NJ tree based on NRY STR Rst distances for subclans of Nattukotai Chet- 
tiar 
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Note: The numbers in the bracket () indicate the sample size. The branch length indi- 
cate the genetic distance. 


Cluster 1 was characterised by Mathur_Karupur and Nemam having high HG R2-124. 
Cluster 2 was characterised by Vairavan Kovil and its subclans having high J2b- 
M221/102. The cluster 3 included the Erani, Pillayarpatti, Surakudi, sub clans of 
Mathur Kovil (Manalur and Uraiyur) and sub clans of Elayathankudi Kovil 
(Perumathurudayar, Kinkinikuradayar and Pattinasamy) mainly possessing HG LI- 
M27/76. Whereas Illupakudi, Mathur Arumbakur, Mathur Kulathur and 
Elayathankudi_Kalanivasal stood distinct. 

NJ tree based on YSTR Rst distances gave a different picture with all the 
Mathur subclans clustered together with the minimal distances from Elayathankudi 
subclans, Pillayarpatti, Surakudi and Illupakudi cluster. The Vairavan formed a 
distinct cluster as similar to NJ tree based on YSNP Fst distances. 


4.6.3 Phylogenetic Network and mismatch distribution analysis: 


Reduced median phylogenetic network HG F*-M89 was mainly represented in 
Arumbakur showing single step mutations. The mismatch distribution showed a 
unimodal peak with MPD of 1.5 (Fig 63a). In HG H1la*-M82 two distinct nuclei can 
be identified in this network. YSTR sharing was observed between Illupakudi and 
Elayathankudi_Pattinasamy in one of the nuclei. The other nucleus was 
overrepresented by Mathur Uraiyur. The mismatch distribution showed multimodal 
peaks with MPD value of 3.89 (Fig 63b). This indicates the diverse sources of HG 
H1la* among these populations. HG J2a*-M410 was localised in Okkur and Nemam 
separated by 8 mutations on YSTR loci. No haplotype sharing was observed among 
them. The same was reflected in mismatch distribution with two distinct bimodal 
peaks with MPD of 6.77 (Fig 63c). This indicates two different sources of HG J2a* in 
these populations. HG J2b- M221/102 was characterised by Vairavan Kovil sub clans 


(Periyavakuppu and Thayanaravakuppu). The mismatch distribution showed a 
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Figure 63: Mismatch distributions of NRY HGs amog Nattukottai Chettiar 
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unimodal peak with MPD of 1.30 indicating recent localized expansion within the 
Varivan clan (Fig 63d). 

HG L1-M27/76 was predominant in Perumathuradayar, Pillayarpatti, Surakudi 
and Erani Kovil formed a one step away cluster (Fig 64). Mathr Manalur formed 
another cluster with multistep mutations from the central median vector. Illupakudi 
formed another distant and distinct cluster in the network. The mismatch distributions 
showed bimodal peak with an MPD of 5.44 (Fig 63e) indicating recent multiple 
sources and demographic expansion haplotypes. HG Rlala-M17 central node was 
mainly represented by Elayathankudi Kalanivasal. Erani Kovil was separated from 
this cluster with one step distance. The mismatch distribution showed unimodal peak 
with MPD of 1.74 (Fig 63f), indicating the single source of haplotypes and expansion 
among these populations. 

4.6.4 K mean clustering analysis 

An attempt was also made to assign the individuals to a hypothetical ancestral 
population by applying Bayesian approach using both YSNP and YSTR information. 
Iterations were run for K 1-9 (number of populations) using ‘STRUCTURE’ software 
(Fig 65). The clans were segregated to its best at K =8. The sub-clans were structured 
based on their prevalent HG composition and unique YSTR evolution within each 
sub-clan, thus implying long term isolation and expansion of these clans. 


4.6.5 BATWING analysis 


To explore the time depths of sub-clan differentiation from each other, 
BATWING analysis was employed (Fig 66). The phylogenetic tree obtained revealed 
that Erani Kovil and Pillayarpatti had a recent split 462 Ybp. This was consistent with 
the fact that these two clans are referred as ‘brotherly clans’ and theirs do intermarry 


among them. On the other hand Nemam and Elayathankkudi Okkur were outliers in 
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the tree having a split time of 10,766Ybp. This cluster diverged them the rest 
4,096Ybp. Arumbakur having only F* showed a split time of 12,944 Ybp with the 
rest of the branches of clan. Overall all the clans showed a coalescence time of 14,858 


Ybp. It is an enigma how reach high fidelity of NRY HG to a clan came into vogue. 
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DISCUSSION 


5. DISCUSSION 


The present study on the India, covering four states, viz. Gujarat — the Kutch 
peninsula and the coastal rout to India, Maharashtra and Karnataka — the upper 
Wesetern Ghat ranges along with the rain shadow hill ranges and plains, lastly 
Andhra Pradesh — the Eastern part of the Deccan, known for it Godavari, Krishna 
river fed agriculture, along with earlier studies from this laboratory (Wells et al., 
2001; Kavitha, 2008; ArunKumar et al., 2012;ArunKumar, 2012) has given some 
definite clue on the peopling of Deccan. This might be much before the so called 
dispersal of the ‘Vedic’ people from Indo-Gangetic Doab. The earlier studies revealed 
that the structured society pre-exited before the introduction of Varna system in Tamil 
Nadu. This might be true even in other parts of the Decccan. I present my 
observations and arguments in 5 different chapters of discussion. 

5.1 The parameters of isolations in various study regions differ 

In an attempt to decipher the people of the four states that might be vital for telling 
the genetic history of whole of India, the first question posed was whether the factors 
that influenced the social structure as castes and tribes correlated to the NRY, i.e, 
male migration were the same in the four study states. The geophysical properties of 
the four states are very different. Gujarat is a monsoon dependent, mostly dry belt that 
was the gateway to the first and probably subsequent migrations of Man. The 
Maharashtra, is the starting point of Western Ghats and has many historical entry 
landmarks. The human habitation in this terrain and in Karnataka was essentially 
supported by north-west monsoon. The Western Ghats extending from south 
Maharashtra to Cape of India and particularly the rain forests of Kerala are one of the 
biodiversity hotspots of plants and animals. Even today, nomadic hunter gatherers like 


some Paliyans live isolated in Aliyar-Parambikulam areas of Western Ghats. The 
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Andhra Pradesh is unique in terms of its perennial river fed rice cultivar populations. 
Any initial population settlements must have taken place in an environment more 
suitable for human habitation, food procurement and survival. Thus the role of 
geography, subsistence, social features and languages were evaluated in shaping the 
NRY gene pool of the study region by employing particularly AMOVA. The 
parameters of stratifying a state (province) population were not the same for all. The 
observations were 

1. The degree of genetic variation between caste and tribe was high among 
Gujarat populations: Further Northern Gujarat were more comprised of caste 
populations while the Southern ones by tribal populations. 

2. Language and geographical barriers determine the NRY composition of the 
Maharashtra populations studied. 

3. Karnataka study populations, mostly from Southern hilly districts showed no 
one to one correlation with language, geography, caste-tribe divide, 
subsistence or other social characteristics. 

4. The populations of Northern Andhra Pradesh studied, were structured based 
on their mode of subsistence. The populations of Godavari and coastal belt- 
fishermen, farming community, warriors, Brahmins etc., are well structured 


and live in sympatric isolation. 


Large number of sociologists, anthropologists and geneticists dwelled in 
deciphering whether castes and tribes were derived from a common ancestor or not. 
Kivisild et al., (2003) suggested that castes and tribal population of India may have 
common origins while other study (Cordaux, 2004) suggested different origins. The 
present study on Gujarat populations found this to be true. All the study populations 


were IE speakers (includes Sindi, Kutchi or Gujarati dialects) and hence language 
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barrier was not the deciding factor. Geographically, majority of the caste populations 
were present in the Kutch belt of Gujarat with high proportion of Rlala-M17. 
Whereas both caste and tribal populations were present in Sourashtran region. 
Majority of tribal populations were localised in the Narmada valley and possessed 
more of H clades (Hla*-M82, H2-Apt) (Table 3, Fig 9). There was a continuum and 
the north Maharashtra tribes possessed appreciable H clades. The geography though 
fit well, the caste tribe divide gave a maximum AMOVA (Table 4) here. The castes 
and tribes showed a coalescent time of ~7.4Kya. When one travels down below the 
Narmada Valley, there lies Satpura and Gondwana hills of Maharashtra. The Central 
Dravidian speaking tribes of Gondwana (Gonds and Raj Gonds) showed over- 
whelming frequencies of NRY HG Hla*-M82 (~70%). Korku, the Western most, AA 
speaking tribes living in Satpura hills and amidst other CDR language speaking tribal 
groups in this region, they have maintained their language identity and also possessed 
appreciable frequencies of the NRY HG O2a-M95, a marker for AA language 
speakers (Reddy et al., 2007). The IE speakers localised in the Sahyadri region 
showed relatively lower HG Hla*-M82 and were distinct from Gonds and Korku 
linguistically, geographically and in NRY HG composition. Hence geography and 
language seem to have co-influenced in isolating the populations of Maharashtra. 
Studies by Thangaraj et al., (2010) have reported the geographical barriers as an 
important factor in shaping the NRY profile of the Maharashtra populations, but the 
mtDNA did not contribute to this distinction. 

The Sahyadri ranges (otherwise west side of Western Ghats) of Maharashtra 
extending to Karnataka were characterised by NRY HG Rlala-M17, J2a*-M410 and 
H clades. Majority of the populations of Western Ghats and coastal regions of 


Karnataka also showed HGs Rlala, Hla*, R2 and J2a. But as we proceed towards the 
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interior and Eastern slopes of Western Ghats, Kannada speaking, food foraging 
populations, such as Yerava and Jenukuruba were characterised by high HG F* 
(23.1% and 21.9% respectively). The majority of tribal populations of Tamil Nadu 
were also localised in these Eastern slopes of Western Ghats (Nilgiri hills). These 
food foraging populations also showed very high frequencies of HG F*(53.25%) 
(ArunKumar et al., 2012). The Kananda speaking tribes of Nilgiri hills showed 
23.62% of HG F*, a frequency similar to that of Yerava and Jenukuruba of 
Karnataka. The NRY HG F* is seen sporadically in other parts of India (unpublished) 
and none in other parts of the world. Consequently with the highest YSTR variance 
and ASD age estimates (Table 37) of HG F* in Tamil Nadu (0.779, 32700 + 5,700) 
followed by Karnataka (0.686, 27,500 + 4,700) indicated this Western Ghat region of 
Tamil Nadu and Karnataka to be the earliest settlement of these Dravidian speaking 
tribal populations in India (Kavitha, 2008; ArunKumar et al., 2012). 

It was hypothesised by Foote, (1876) that early human habituation dating back 
to Palaeolithic was not possible in Western Ghats due to heavy rainfall and thick 
vegetation. Alternatively it was also hypothesised that this rich vegetation zone must 
have attracted early humans because of ease of availability of resources but this lacks 
corresponding evidences (Chauhan, 2010). The present study and those from this 
laboratory, confirms that these Western Ghat were occupied as early as Upper 
Palaeolithic age. The presence of high frequencies of F* in Nadar populations of 
Tamil Nadu makes one to wonder whether upon agricultural expansion into 
previously non-cultivated areas, the tribal populations might have shun the 
newcomers and took shelter in more isolated prefereably forest and mountainous 


ranges,thus retaining their mode of subsistence and genetic distinctiveness until the 
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present day. A similar hypothesis of taking shelter in the mountainous ranges was 
proposed by Sanghvi and Karve, (1981) as well. 

Effort was made to study the south Karnataka populations in the mountainous 
ranges adjascent to Nilgiris particularly Kutta and Mangalore surroundings. Northern 
Karnatana was purposefully omitted for many reasons such as, the influence of 
Vijayanagara Empire and the resultant later developments in population dynamics. 
Most of the populations except three of the tribal populations were very diverse with 
high variance but no admixture was detected for at-least for the past ~4.6Kya (Fig 
29). The study populations showed no correlation with language, geography or other 
social characteristics in the AMOVA analysis. And each population had their distinct 
genetic legacy and expansion. For example, the ages of agricultural based population 
(Adikarnataka, Kuruba, Gowda, Mogaveera and Kodava) obtained in the present 
study was ~4.6Kya and this correlated with the ‘ash mound tradition’ of Southern 
Neolithic age characterised by agro-pastoral activities (Boivin et al., 2008). The major 
crops cultivated during this period were native millets and pulses (Fuller, 2006). 
Genetic studies by Rajkumar and Kashyap, (2004) on four populations of Karnataka 
using 15 autosomal loci also did not reveal any linguistic or geography based 
clustering. They also inferred from their studies that the populations either had 
common ancestry or have experienced very high gene flow. 

The region of Kerala studied by Kavitha (2008) is a very interesting region of 
India, with lot of population movements and long standing maritime trade with 
Roman Empire. Kerala showed even frequencies and low Rst and Fst distance of 
different NRY HGs indicating significant genetic amalgamation, of various castes. 
The present study along with earlier ones from this laboratory also suggested the 


evergreen Western Ghats were the preferred settlement sites for many ancient tribes 
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and the later arriving stratified castes. Several studies have suggested a common 
genetic signature among distantly ranked-caste populations in South India (Shangvi et 
al., 1981; Watkins et al., 2008). The NRY studies on Tamil Nadu populations showed 
evidences for genetic structuring based on mode of subsistence and also suggested the 
existence of social stratification in Tamil Nadu prior to the establishment of Varna 
system (Arunkumar et al., 2012). This provided a classical example of societal 
formation. The genetic impact of Varna system on pre-existing populations was very 
minimal. A similar picture emerged from the present studies on Andhra Pradesh: the 
agricultural basin of Godavari belt revealed a more similar situation as that of Tamil 
Nadu. 

Previous studies based on autosomal markers, Y chromosome and mtDNA 
have attempted to classify the populations of Andhra Pradesh based on social (upper, 
middle and lower) and Varna (Brahmin, Kshatriya, Vyshya and Shudara) 
classification. Study by Bamshad et al., (1998) showed that mt DNA (N=250) based 
genetic distances correlated with social rank whereas Y chromosome genetic 
distances did not correlate with social rank. Their study concludes that the variation in 
Y chromosome is the result of mutation and drift. The movement of females across 
social rank as a result of hypergyny has resulted in social stratification. Whereas 
studies by Ramana et al., (2001) (N:204) and Cordaux, (2004) suggested heavy gene 
flow among the populations in Andhra Pradesh. Another study based on autosomal 
microsatellite loci (Reddy et al., 2005) with a large sample size (N:948) suggested the 
possibility of recent ancestors for the populations in Andhra Pradesh or extensive 
gene flow among these populations erased the original genetic differences. Their 
results suggested the lack of significant genetic differentiation based on social 


stratification. Most of the earlier studies were based on lower sample size and 
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covering a wider area. But the present study considering a definite Geography, 
Godavari belt and a larger sample size of 774, did not observe any such heavy gene 
flow for atleast ~2Kya in majority of the populations investigated. 

Drawing conclusions from the process of establishment of Varna system and 
its impact on genetic systems are contradictory. The establishment of this system has 
not been uniform in India (Champakalakshmi, 2001). The present study observes that 
tribal populations such as Konda Reddy and Konda Kammara have remained isolated 
for at least 8,238 and 5,990 years respectively from their neighbouring populations in 
the Eastern Ghats with minimal or no gene flow, predating the establishment of varna 
system (ie., 1.2K ya). Whereas the ages of Brahmin groups correlates with the arrival 
of IE speakers to South India (~3Kya). The ages of other farming population such as 
Kapu and Yadava correlates with the cultivation of pulses and rice in Eastern Ghats 
region between Godavari and Krishna rivers (Fuller, 2006) and animal domestication 
respectively. This age also marks the spread of rice in south India (Fuller, 2002). 
Similarly Gujarat was one of the earliest known regions for food producing 
complexes from Harappan region (Liversage, 1989). The early Harappan origins 
started in North Gujarat around 5000 years ago. This corresponds with the cultivation 
of native millets in north and Sourashtra (Fuller, 2006). The ages of study populations 
of Gujarat and Andhra Pradesh thus correlated to the period of millet and rice 
cultivation, that probably supported effective population expansion in this region and 
the dates obtained for these expansions in the present study reiterated the existence of 
agricultural societies in these regions well prior to the Varna system itself. 

It is imperative to discuss about the populations of Orissa and North Eastern 
region. Most of the populations in these two regions speak Austro Asiatic languages 


with minor proportions Tibeto Burmese and Central Dravidian. Studies from this 
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laboratory on a total of 2,558 samples from these two regions and their analyses have 
clearly shown that it was the AA language that developed along with the NRY HG 
O2a-M95. And presumably this clade and language originated in Lao region and 
expanded to other parts of India (Arunkumar, Lui Hui et a/ unpublished data). Thus 
the presence of O2a in Korku, an Austro Asiatic speaking tribe from Maharashtra is 
the Western most limits of AA speakers and in all probability, as suggeseted by 
network and other observations, might be the result of a back migration from Lao, 
Orissa and Central India regions. 

Though Lao is considered as the place of origin/ early settlement and 
successful expansions of O2a and AA language speakers, one cannot vouschafe from 
the available methodologies and approach, the exact location of mutation of O2a! 
Nonetheless, many studies based on autosomal SNPs, Y chromosome and mt DNA 
have suggested the origin of AA speakers from South East Asia (Choubey et al., 
2011). Studies on tribes of Madhya Pradesh on the contrary suggested that linguistic 
label doesn’t unequivocally follow the genetic imprints (Sharma et a/., 2012). Certain 
other studies infact proposed an in-situ origin of AA speakers in India and proposed a 
missing link between South and South-East Asian populations (Basu et al., 2003; 
Kumar et al., 2007; Reddy et al., 2007). The present observation on the extant of 
distribution from Maharashtra along with those observed earlier from this laboratory 
however presents a clear case of language NRY HG affiliation. 

The IE and DR languages though showed clear distribution and NRY 
associations, they were skewed by many populations that were either miscegenated or 
language replaced to varying degree. A classical example obtained in the present 
study was the Patel and Koli caste of Gujarat that cluster with tribes and showed 


many tribal cultural characteristics that was elaborated in respective section. 
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To conclude, on the contrary to several studies that included populations from 
diverse geographic locations of India and interpreting the genetic structure of Indian 
populations, or have studied populations from a specific geographical location and 
have extrapolated the observations to Indian subcontinent or north or south India, here 
an attempt was made to study the genetic structure of entire Deccan (Maharshtra, 
Karnataka, Kerala, Tamil Nadu, Andhra Pradesh and Orissa) along with those of 
Gujarat, the gateway to India (Table 38) by studying 59 well defined castes and 45 
tribes. The study reiterated our contention, that each region / state of India needs to be 
considered in the context of its socio-geography and cultural characteristics so that the 
real factor shaping the gene pool is understood. The attempt paid dividend by 
identifying the factors that are responsible for the sympatric isolations observed: 
while Gujarat and Maharashtra populations were structured based on caste tribe 
divide and/ or geography, the Karnataka populations showed no one to one correlation 
with any of the social, language or geography parameters, while the Andhra Pradesh 
demes were structured similar to those of Tamil Nadu populations and the structuring 
thus occurred well before the introduction of Varna system in to these regions and 
agricultural settlements as suggested by Fuller et al., 2006 played a major role in this 
structuring. While the AA speakers live in Orissa and northeast belt, the Dravidian 
speakers seems to have evolved in the Deccan. As revealed by the present study, the 
H clades seems to have characterized a much dispersed ancient populations and are 
characteristic of Deccan, Central India and Himachal regions, while L1 characterizes 
the south Dravidian speakers. Whether the central Dravidian speakers were ancient to 
the south Dravidian speakers or vice-versa may need to be further investigated. The 
scenario might have emerged due to different sequence of migrations, isolation and 


evolution of these language families and their gene poolin India. The PCA plot 
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Table 38: NRY based AMOVA of various sub groupings for Deccan Indian Populations 


| CT SNP CT STR 

No of 

Booe| we [ewe [re | me [mw [re 
.1320 | 0.2175 | 0.0985 | 0.1085 | 0.1658 | 0.0643 | 
2663 


Sub groupings 


0.2175 | 0.0985 | 0.1085 | 0.1658 | 0.0643 
| Fse | Fst | Fet_| Fse [| Fst | Fet | 


* AA speakers were eliminated in this amova analysis as they could artifically induce false positive results 


a ee ee ee ee 
0.0490 


Sahyadri, east of western ghats, south deccan, 

eee Eee 0.1043 | 0.1577 | 0.0597 | 0.0818 | 0.1268 

satpura, narmada valley, upper deccan (east 

Sahyadri, east of western ghats, south deccan 

plains, central deccan plains, eastern ghats, 

gondwana, satpura, narmada valley, west 10 | 0.1061 | 0.2028 | 0.1081 | 0.0852 | 0.1566 | 0.0780 
deccan, east deccan 


| CCSusbssistence CT CE CE CC CC 
Hunter, domestication, agriculture, artisan, 

si Sie ae tel Maa atiaiin 0.1225 | 0.1484 | 0.0296 | 0.0984 | 0.1184 | 0.0222 
warrior, brahmin (AA excluded) 


DR populations: 


mea ~ 9 0.0705 [0.1400 | 0.0747 | 0.0558 | 0.1178 | 0.0657, 


Fet 


Fct 
0 


IE caste tribe (Parsee removed) 
IE caste SC tribe (Parsee removed) 


*AA : Austro Asiatic 

* TE : Indo European 

* DR : Dravidian 

*CDR : Central Dravidian 

*SCDR : South Central Dravidian 

1. AA speakers were eliminated to avoid the false variation coused by these populations 

* Parsee and siddi were removed as they were migrant population from Africa and Iran respectively 


Based on subsistence: Br, agricultrue caste, 
vatiiot mabe Denies eibe areal ade 0.0858 | 0.1491 | 0.0693 | 0.0767 | 0.1189 | 0.0457 
agriculture, pastoral,SC (Parsee removed) 


depicting the above argument is shown in Fig 67. Hence the parameters that 
determine the structuring the people of Deccan are not uniform and one cannot 


consider the whole of India as a single entity, and with a given preconceived template. 


5.2: Various tribes and castes of a given region may have different origins: 


India is the second most populous country with 1.21 billion people (2011 
census). Tribe constitutes 8% of total Indian population. There are ~450 tribal 
communities in India (Singh, 1992) who speak ~750 dialects (Kosambi, 1991). They 
differ in their geographical distribution and display diversities in terms of 
demographic parameters such as habits, customs, beliefs, subsistence, language and 
ethnicity (Bhasin, 2006). It is also believed that tribes are possibly the original 
inhabitants of India. In recent years, several studies addressed the origin and antiquity 
of castes and tribes. Krithika et al., (2009) have proposed two paradigms for 
explaining the linguistic and ethnic affiliation of the tribal populations: 

A. The early migrants or settlers had a common origin and spoke a common 
language. In course of time, dispersing in to different geographical areas, 
acquired different languages due to cultural diffusion, separation or 
isolation. 

B. Alternatively, diverse endogamous groups speaking different languages 
settled over the same or a contiguous geographical expanse at different 
times and in due course of time their language may be overlaid by an 
adapted or acquired local language. But still, these diverse endogamous 


groups may retain their biological identity. 


A phenomenon of language shift was the common in these cases. For example 
Mushar- an AA speaking population from Uttar Pradesh (Chaubey et al., 2008) 


revealed a genetic affinity neighbouring AA speakers than to the IE speakers but had 
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Figure 67: PCA plot based on Fst distances among populations from various region of 
Deccan 


PC2 


Note: The populations of north and south Orissa are distinct from the other Deccan popu- 
lations indicating distinct migratory and evolutionary patterns among these populations as 
compared with other Deccan populaitons 


undergone a language shift to Hindi (IE family). The mtDNA and Y chromosomal 
analysis, thereby suggesting such linguistic shifts may not necessarily be a signal for a 
rapid genetic admixture, either maternally or paternally. 

Certain tribes have historically documentation of their migration into India. 
The classical example being Siddi (N=37). Non-indigenous markers such as HGs B- 
M60 (5.4%), BR-M139 (18.5%) and CR-M168 (54.1%) have been identified in this 
population and absent among all the other study population. These HGs have been 
identified in the Central Sahel in Africa (unpublished data). But the YSNP HGs and 
gene diversity of Siddis was calculated to be 0.6757 + 0.0736, similar to the other 
tribes within Gujarat (Results: section 4.1). This gives a clue for the assimilation of 
indigenous HGs in the place of settlement. Other genetic studies also reveal the 
presence of non African genetic markers from local Indian populations that Siddis 
may have assimilated (Ramana et al., 2001; Shah et al., 2011). 

The time of settlement of these Siddis were brought out by the ASD age 
estimates: HGs B-M160 and BR-M139 of Siddis from Gujarat were 3.7 + 1.2Kya and 
10.4 + 3.3Kya but the suggested time of entry of these populations was 15°19" 
century (Shah et al., 2011). The other HGs present in this population were not 
statistically significant. Shah et al., (2011) studying the autosomal and uniparental 
markers of Siddis (N=154) of Gujarat and Karnataka, has suggested that the Bantu 
speaking populations from sub Saharan Africa migrated toward the Indian 
subcontinent with the agriculture expansion from central western Africa. Overall 
these results both NRY and other genetic markers have comfirmed the non-Indian 
origin of Siddis. 

The present study on Korku, a Maharashtra tribe, the western most AA 


speaking population of India showed high frequency of HG O2a-M95, similar to the 
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other AA speaking populations of Orissa with an ASD age of 9.5 + 2.4 Kya. The 
YSNP gene diversity was low (0.4397 + 0.0983) but the Fst/Rst distances of Korku 
with its neighbouring populations was also high. All these indicate isolation of Korku 
among the hills of Satpura hills and distinct ancestry. 

Genetic similarities between two adjacent but two different language speakers 
were not uncommon in India, inspite of the endogamous nature of most of the 
populations: it might be due to fission and one population adopting a different 
language. Kolam speaking a CDR language, is distributed along the Satpura hills and 
showed equal proportions of HGs L1-M27/76 (a marker for Dravidian as suggested 
by Sengupta et al., (2006), HG O2a-M95 (a marker for AA speakers and R2-M124 
(unique to India and distributed in many warrior populations). . This was evident in 
the PCA and MDS plots (Results section 4.2) and the YSNP gene diversity of Kolam 
was also high 0.8917 + 0.0271. The Y-SNP data thus indicated a shared ancestry of 
Kolam. 

Surprisingly the HG H1la*-M82, an early successful expanded HG of most of 
India, was ten times less than the other CDR speakers such as Gonds of Maharashtra. 
One has to remember that the tribes need not essentially be either Dravidian or AA 
speakers only. All the tribes of Gujarat were IE speakers and they were distributed 
along the Narmada valley and Sahyadri region of Maharashtra. These tribes showed 
high proportions of HG Hla*-M82 (>20%) a marker for early settlers and H2-Apt 
(9%). 

The ethnographical details of Gujarat and Maharashtra populations suggested 
the practise of cross-cousin marriages by certain endogamous populations (such as 
Kathodia, Kotwalia and Maldhari in present study) was similar to that of Dravidian 


populations (Southworth, 2005; Trautmann, 1981). The Rlal which is the commonest 
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in IE speaking Brahmin related populations were absent or low in these tribes. Thus 
this might be clear case of language replacement of ancient populations. . Similarly, 
Koraga, a SDR tribe of Karnataka, inhabiting the regions of Udupi and Mangalore 
also showed 89% of HG Hla*-M82 with an ASD age estimate of 9.9 + 3.7 Kya. This 
implies that Dravidian cultural practises persisted in these areas and the language shift 
could not change their cultural characteristics — a unique phenomenon of tribals of the 
world and India. Most probable explanation for the language shift in Gujarat and 
Maharashtra tribes can be attributed to the incoming population being larger in size 
and the whole of surrounding speaks a better (IE) language so the local tribes could 
have adopted this new language. Further indepth study is warranted to understand 
how the language shifts occured inspite of retaining the cultural characeristics. 
Genetic dissimilarities between the tribes of the same language families can be 
explained in many ways: but most probable was genetically disparate tribes adopting 
a language in the new settlement place or influence by invasions. Jenukuruba and 
Yerava tribes speaking Kannada dialects (SDR language family) lives in east part of 
Western Ghats (Coorg) and showed statistically significant proportion of HG F*-M89 
(>20%), with an ASD estimates of 12.1 + 4.1 and 32.3 + 7.5Kya respectively. 
Similarly the HG F*-M89 among the telugu speaking tribe of Eastern Ghats (Konda 
Reddy) was seen in the frequency of 21.6% with an ASD of 16.4 3.3Kya. The 
presence of HG Hla*-M82 was also not significant in these tribes. Konda Reddy and 
Konda Kammara in contrast to those of Western Ghats tribes showed high frequencies 
of HG O2a-M95 (27% and 32.6% respectively) also. This HG was present in very 
minimal frequencies in other study tribes (except Korku-an AA tribe) as well as the 
Brahmin and farming related populations of Andhra Pradesh. The BATWING 


phylogenetic tree revealed long term isolation of these tribes from its neighbouring 
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populations in Godavari belt. HG O2a-M95 is one of the frequent haplogroup in the 
regions of Orissa that lie adjacent to Andhra Pradesh. The ASD estimates of HG O2a 
among these tribes were 14.6 + 2.2 and 24.2 + 7.2Kya respectively, which were 
similar to those observed in AA speaking tribes of Orissa (Table 37). Studies on 
genetic markers and anthropometric variables observations by earlier studies also 
showed the genetic similarities of Oraon and Mal Paharia, a non-Mundari speaking 
groups with other Mundari speakers. Therefore further study on the affinities of these 
tribes with that of Orissa populations is warranted. 

Sometime a caste with a distinct mode of subsistence such as agriculture may 
rank and cluster with tribal populations of the region. This was true with Patel, an 
agricultural based caste population clutering with Koli, a population that subsist on 
fishing, and other tribes, while all other Brahmin related castes cluster distinctly in 
BATWING tree.They (Patel and Koli) exhibited higher YSNP gene diversity (0.8341 
and 0.6212 respectively) in contrast to other caste populations in that region and also 
showed low frequencies of Rlala-M17 (statistically non-significant), higher 
frequencies of HG Hla*-M82 (26.8% and 31.6%) and higher Fst/Rst distances in 
comparison to other caste populations. The BATWING showed their clustering 
together. Thus the caste and its name christened by British ethnologists and 
antorhopologist might be based on their cultural characteristics, presumably a 
population fissioned drifted in different directions acquiring various cultural 
characteristics from the surrounding populations. This was not quite uncommon in the 
history of Mankind. This ascertain that the caste/ tribe divide may be a cultural 
evolution depending particularly on their mode of subsistence and the opportunity to 


diverge in their occupation and egging their living, available to them. Here 
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inmigration of culturally different population such as JE European language speaker 
seems to have played a major role. 
5.2.1: Not all the Brhmin populations had common ancestry: 

It is generally believed that ‘Brahmin’, is a ‘large’ word embracing a large 
entity of putative Central Asian migrants, or originating in Hind Kush ranges, settled 
during late Harappan phase and spreading to Indo Gangetic Doab and other parts of 
India. Study by Arunkumar (2012) has suggested an in-situ origin of Rlal in India, and 
the spread of IE languages to India as a result of a cultural diffusion than a genetic one. 
The present study comprising many Brahmin populations and those available in 
literature gave a chance to interpret these in terms of NRY chromosome and in the 
light of male mediated migration. 

The four states studied consisted many Brahmins populations: not all of them 
showed similar NRY HG composition. Brahmins of Gujarat showed high proportions 
of HG Rlala-M17 with an ASD age of ~5Kya and in the phylogenetic network (Fig 
14b) Brahmin related IE speaking populations clustered distinctly from other 
populations and tribal populations. Similar pattern of distinct clustering of IE 
speaking Brahmin related (including Parsee, Maratha) and tribal populations were 
observed in Maharashtra populations as well. 

Karve, (1961) work also indicated that each of the different Brahmin castes 
(Chitpavan, Sarasvat, etc.) in Maharashtra probably has a different origin. Parsees 
showing close affinity with Brahmin Desastha and Brahmin Chitpavan in NJ tree, but 
not in BATWING raises a question about their origin and the period of isolation and 
inbreeding they have experience. According to Qissa — I Sanjan, Parsees are thought 
to have migrated from Khorasan (ancient Parthia) to avoid persecution by Arabs. Mt 


DNA analysis showed very high frequency of haplogroup M among Parsee (55%), 
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similar to those of Indian populations and much higher than that of a combined 
Iranian sample (1.7%), highlighting the derivation of their maternal component form 
autochthonous Indian mtDNAs. McElreavey and Quintana-Murci, (2005) has reprted 
an admixture estimate of 100% from India. Qamar et al., (2002) suggested an Iranian 
origin based on their Y chromosome analysis (N=90). The present study showed that 
Parsees (N=86) possessed high proportion (statistically significant) of HG J2a*-M410 
(33.7%) with a higher ASD age of 30.4 + 7.6Kya as compared to all other study 
populations (Appendix 11). This HG has been suggested to have an exogenous origin 
(Senguupta et al., 2006). 

The story of Brahmin populations in Karnataka and Andhra Pradesh were 
quite diverse as evident in the PCA, MDS and BAWING phylogenetic trees in these 
states. The BATWING tree on NRY data on Havyaka Brahmins reveals that they had 
remained isolated for atleast 5-7 Kya, from other Brahmin populations of Karnataka 
such as Goud Saraswath and Iyengar and they cluster interspersed with populations of 
diverse mode of subsistence and cultural characteristics. 

The oral migratory history of Havyaka Brahmin states that 32 families 
migrated from Ahicchatra (present Uttar Pradesh) in to Banavasi region of Karnataka 
only during 345-360 AD during Kadamaba rule. Painted Grey Ware pottery were also 
first found at Ahicchatra in Bareilly district of Uttar Pradesh (Ghosh and Panigrahi, 
1946). Presumably they diverged and stayed isolated from other Brahmin groups way 
back in history before arriving to Karnataka. The oral history of Saraswath Brahmin 
that they migrated from Saraswathi river basin gave an age estimate of 4,604 years in 
the BATWING analysis. Similarly the Brahmin populations of Andhra Pradesh also 
were very diverse. Brahmin Dravida are marked by high proportions of HGs J2a*- 


M410 (24%) and G-M201 (27%), whereas the Brahmin ANV was marked by 
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significant proportion of HG Rlala-M17 (55%). It is interesting to note that these two 
Brahmins did not cluster with each other; rather they showed closeness with other 
agricultural populations of Andhra Pradesh. Thus it could be possible that the 
Brahmin populations either differentiated in India or had diverse origins. 

The present study thus proposed that different he caste and tribal populations 
of the study states in particular and the whole of India in general could have been 
derived from a common gene pool or different origins and migratory patterns: 
nonetheless one cannot categorically say that tribes were all different from castes The 
long term isolation and expansion have been observed both in tribal and caste 
populations studied. The language cannot be used as a defining criteria and proxy in 
these study states. The statement of Karve (1961) “it is not generally realised that the 
caste society in a sense was a very elastic society” has not been realized by many 
recent workers and they considered these two as watertight compartments. A caste 
bearing the same name may have very different origins in different geographical 
regions. There are examples in which a tribe dispersed over a large geographical 
region, took up different occupations in different sub-regions, and “fitted” itself into 
the caste hierarchy on different rungs. Similarly different caste may have different 
origin. Thus, the origin of caste populations may not be uniform over the entire Indian 
geographical space. 

5.3 The Distribution of L1 and the story of Dravidian: 
5.3.1: The Dravidian: 

The story of Dravidian is a great enigma that defied a definite answer for long. 
Dravidian people or peoples are terms used to refer to the diverse groups of people 
who natively speak languages belonging to the Dravidian language family. 


Populations of speakers of around 220 million are found mostly in Southern India. 
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Other Dravidian people are found in parts of central India, Sri Lanka, Bangladesh 
and Nepal. The Dravidian language family consists of 85 languages and are spoken by 
about 217 million people. The most populous Dravidian people are 
the Tamils, Telugus, Kannadigas, and the Malayalis. Smaller Dravidian communities 
with 1-5 million speakers are the Tuluvas, Gonds and Brahui (Krishnamurti, 2003) 

Dravidian languages are native to India and epigraphically the Dravidian 
languages have been attested since the 6th century BCE. Only two Dravidian 
languages are exclusively spoken outside India: Brahui and Dhangar, a dialect 
of Kurukh. Dravidian place-names — onomastics have been studied in the modern 
times and found cluster of Dravidian place names along the northwest coast 
of Maharashtra, Gujara, and to a lesser extent in Sindh including Indus valley 
settlements and Pakistan (Balakrishnan,1993). Dravidian grammatical influences such 
as clusivity are foundin Marathi, Gujarati, Marwari, and to a lesser extent 
Sindhi languages, suggesting that Dravidian languages must have been once spoken 
more widely across the India subcontinent. For this reason the present study included 
Maharashtra and Gujarat state as well in the attempt of identifying the genetic 
similarities of these people. 

While a number of earlier anthropologists held the view that the Dravidian 
people together were a distinct race, a number of recent genetic studies based on 
uniparental markers have challenged this view. Although in modern times majority of 
the Dravidian speakers occupy the Southern peninsula, Deccan, nothing is definite 
about their ancient distribution (Fig 68). However it is well established that the 
various Dravidian language speakers much have been wide spread throughout India 
ancient times, as supported by the presence of language isolates even today 


throughout India. The pattern of distribution of IE language speakers and their NRY 
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Figure 68. Map of the Dravidian and Munda languages. (From Trautmann 1981:10) 
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chromosomal distributions (Arunmukar et al., 2012, Sharma et al., 2009) makes one 
to seriously think whether it was a very successful expansion of the IE speakers in 
Indo-Gangetic doab that pushed the Central Dravidian speakers to the forests of 
Madhya Pradesh and Orissa belt. 

There have also been many hypotheses on the origin of Dravidian language 
itself. Based on the Nostratic hypothesis, Dravidian language has been suggested to be 
akin with Proto-Elamite, which was spoken in the Fertile Crescent. It has been 
proposed that this language speakers spread eastwards towards Indus region along 
with farming technology (McAlpin, 1981; Cavalli-Sforza, 1996; Renfrew, 1996). The 
Neolithic settlement (~6,500 years) in Mehrgarh (south West Pakistan) showing a 
continuum of artefacts in their stratigraphy for about 4kya and the evidence of barley 
cultivation and agro-pastoralism in this region by this sedentary people is suggestive 
of Dravidian cultural element (Kochar, 2001,). Earlier studies have suggested the 
demic expansion as a cause of dispersal of many ancient populations and the dispersal 
of Dravidian towards eastwards is attributed to such dispersal (Renfrew, 1996). Such a 
dispersal is supported by mtDNA and Y chromosomal similarity of Brahui a North 
Dravidian language speakers in Pakistan, to that of Middle East (Krishnamurti, 2003; 
McElreavey and Quintana-Murci, 2005). It is possible that Dravidian linguistics were 
ancient ‘Lingua Franca’ of a wider area from Fertile Crescent to India (Pitchappan, 
2002). The present day existence of Dravidian languages however in India 
unequivocally atleast India for having nurtured this language. It was of interest for 
long to identify the people who were responsible for this language or at least the land 
which first given birth and nurtured this language. In this case, very much the 


language can also be equated to culture. 
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5.3.2. NRY HG L1 chromosomal evidences. 

With this background, now I interpret the results obtained in the present study 
in the light of those available in literature on NRY as well other genetic markers. The 
study by Qamar et al., (2002) identified HG L to be one of the common haplogroup in 
populations of Pakistan (14%) with the exceptions of Hazara and Kashmiris. The 
admixture analysis showed a non—Jewish origin of HG L with the caution being given 
on low sample sizes. Similar analysis also eliminated the possibility of non-Syrian 
origin of HG L among IE speakers- Baluch. Their study suggested a Neolithic origin 
of HG L that might be associated with the local expansion of farmers (TMRCA: 
~7,000 95% CI: 4,000-14,000) years. Cordaux, (2004) study on Indian samples 
suggested a package of HGs: J2, Rla, R2 and L being non Indian origin as they were 
present in higher frequencies in caste populations. 

The Y chromosome phylogenetic tree of HG L (2008) is given in Fig 69. 

HG LI, a subgroup of HG L has been associated with Dravidian languages 
(Sengupta et al., 2006). This was essentially based on microsatellite variance of HG 
L1 that was high in south India compared to Indus region. This study indicated the 
possibility of early diversification in Dravidian speakers and subsequent expansion 
towards peripheral regions, thus supporting indigenous origin of HG L1 during early 
Holocene (~9Kya). 

The origin and distribution of NRY HG L and its descendant is crucial for 
defining the Dravidian question. HG L*-M11 have been identified in Turkey 
(Cinnioglu et al., 2004). YSTR variance in Armeninas was found to be 0.41 (N=22) 
with age of 14.6 Kya and beta mean of 26.3Kya. These estimates overlap with that of 
Indian HG L1 ages (N=376). But the Armenians and Turkish populations do not show 


M27 mutation, a defining mutation of HG L1 that has been the characteristic of Indian 
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Figure. 69: Y chromosome phylogenetic tree of HG L with its defining SNP 
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and Pakistan lineages. (Cinnioglu et al., 2004). Armenian YSTR lineages match with 
most of the Turkish counterparts. This provides the clue for the absence of HG L1- 
M27 in Armenians. The study also has eliminated the possibility of Syria (HG L: 
3.98%) or Pakistan (HG L: 9%) being the origin of HG L1. But this study was limited 
by smaller sample size. The north Dravidian speaker, Brahui showed HG L1 in only 
1/25 samples (Sengupta et al., 2006). Hence the query of origin of HG L1 being 
associated with Dravidian in south India demanded further exploration. 

In the present study, the north Indian populations (yellow cluster in Fig 40) 
showed small effective population size (351, 95% CI: 133-1,001) whereas, Deccan 
showed 2,888, 95% CI: 1,133-11,760. The TMRCA of North India was lower 
(~22Kya) than Deccan India (~74Kya) with marginal overlapping confidence 
intervals. But the population expansion times of 16,975 (95% CI:1,575-31,025) in 
Deccan and 15,725 (95% CI: 9,275-26,875) in North Indian regions (Table 22) have 
indicated an dispersal of HG L1 from Deccan to North Indian regions. Further within 
Deccan, HG L1 expanded in SDR speakers showing an average of 6.57% of HG LI, 
compared to Central Dravidian speakers with only 1.52% of HG Llie, four fold 
higher than CDR. 

The stratification of data further based on linguistic states, the age estimate 
and other statistical analysis narrowed down Karnataka or Tamil Nadu as the most 
probable region of expansion of HG L1. The expansion of HG L1 has been higher in 
Tamil Nadu (17Kya) whereas Karnataka showed (10.5Kya) though with overlapping 
confidence intervals. The YSTR variance and ASD estimates all were higher in Tamil 
Nadu. But the effective ancestral populations size (Na) of Tamil Nadu was lower than 
that of Karnataka: this suggested a smaller founder population but expanding to 


greater extant in Tamil Nadu populations and possible inmigrations from nearby 
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regions in Karnataka. Of the three subtypes of L tested the HG L3*-M357 were seen 
in minimal frequencies in Tamil Nadu in the present study whereas L2-M317 was 
absent. Further identification of sub haplogroups of L1 would resolve whether Tamil 
Nadu could be the origin or atleast the earliest settlement and successful expansion of 
Ll. 

To decipher the exact region of origin of HG several factors need to be taken 
into consideration such as haplogroup frequency, significant Fishers p value, 
accumulated YSTR diversity, high age, high Na and presence of the other HG sub 
groups within the same geographical area. As it stands in the data from the present 
study qualifies the Tamil Nadu with all the above said features as a candidate for the 
origin of L1 in India. 

Interestingly one of the samples studied from India showed an L*-M20, the 
parent of clade Ll. This L*-M20 have been identified in Lebanon and Arabian 
populations but with very low frequency (0.051 and 0.018) respectively (Zalloua et 
al., 2008). The other two derivatives of L*, viz. L2 and L3 are seen in the world, 
presumably not very successful in expansion as that of L1. East Caucasus populations 
possessed NRY HG L2-M317 (<3%) and 1% in Tajiks (Haber et al., 2012). HG L2 in 
India was present in the frequency of 0.08%. This HG is also referred as 
Mediterranean haplogroup. In contrast the L3 has been identified in the present study 
in many Northern Indian populations. Two successful interconnected 8 STR lineages 
emerged, one expanding comprising all Northern Indian samples studied and the other 
comprising all the samples of Southern Indian population, as well as samples from 
Pakistan, Afghanistan and East Caucasus. but these later samples appearing in the 


terminal branches of the network (Fig 46). 


118 


The TMRCA of HG LI dates to Holocene (27,524, 95% CI: 19,473-41,808) 
with and expansion being during early Neolithic in majority of the study regions. This 
is consensus with the other studies (Thangaraj et al., 2010). Neolithic ages in south 
India had been predominantly agro-pastoral. Therefore it could be probable that the 
spread of HG L1 in SDR speakers of south India could have been mediated by 
farming. 

5.4: The HG L3*-M357: Brokpa and their ‘Aryan’ claim: 

As per the NRY HG phylogeny, L clade — M20 is derived into three subclades, 
of which L1 is common to India (present study), whereas L2 is distributed Turkey and 
the most enigmatic and sparse of these, is L3. HG L3* is defined by M357 SNP 
mutation and is a sub haplogroup of HG L-M20 (Fig 69). 

Each one of these mutations has a distinct geographical affiliation and polarity 
of spread (Sengupta et al., 2006). HG L3*-M357 is present in Afghanistan and 
Pakistan (7.4% and 6.8% respectively) (Lacau et al., 2012 ; Abu-Amero et al., 2009), 
and a much lower frequencies (0.6% of each) were identified in SAR, Iran, UAR 
(Abu-Amero et al., 2009). Earlier study from this laboratory has suggested an external 
origin of HG L3*-M357, probably due to recent gene flow from western Eurasia 
(Arunkumar et al., 2012). Hence in the present study attempt was made to elucidate 
the migratory pattern of HG L3* in India. All the L3* samples obtained in The 
Genogrpahic project, in addition to my own investigations on the 4 states explored, 
were thus considered, along with those in literatures to deconstruct the question of 
origin and dispersal of HG L3*-M357. 

In the present Genographic study L3* samples have been identified 
sporadically in various regions of India (Fig 44a). The pattern of distribution and also 


various statistical analyses suggested two distinct migration pathways for Hg L3*- 
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M357. First being the movement of people from Afghanistan to Jammu region via 
Pakistan, presumably mediated by the Dardic (IE language family) speakers. The 
Brokpa (Buddhist and Islam) population subsisting on pastoralist activities in Dha 
Hanu region of Jammu shows a higher frequency of L3*-M357 (69% and 38% 
respectively). The oral migratory history of Brokpa claims that their ancestors moved 
to Dha Hanu villages from Gilgit region that borders Pakistan and Afghanistan. The 
dating of this movement based on HG L3* based on BATWING phylogenetic tree 
shows a time depth of ~18Ky and since then of the populations of Jammu remained 
isolated from the rest of the neighbours. The L3*-M357 of 6 populations of Himachal 
Pradesh, 3 of Punjab and 10 of Rajasthan were separated from these Jammu 
populations, as deciphered in the network and BATWING tree of the present study 
(Fig 46, 48) for atleast 14Kya. The network clearly showed the dispersal of this 
branch of L3* STRs, along with the IE speaking populations of north and Western 
Indian states studied under The Genographic (unpublished). 

Here we need to essentially answer the claim of Brokpa as ‘Pure Aryan’ that 
has attracted the attention of the world and many European visitors. This raises a 
question on the identity of ‘Aryan’ themselves: whether it was the putative 
hegemonic Central Asian gene pool and spread or there was no such thing as an 
‘Aryan’ race or gene pool as profounded of late by many historians of India 
(Thappar,1990). The predominant presence of L3* in Brokpa and its sporadic 
presence in Pakistan, but reasonable and widespread presence in isolated populations 
of Northern and Southern, both IE speaking and DR speaking (see below) suggest an 
early dispersal, though unsuccessful expansion of this clade in India. Essentially this 
argument negates the hegemonic claim of Brokpa as such. I hasten to add in the 


present day consensus that there was no Aryan race and no single invasion of central 
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Asian population; the myth of ‘Aryan invasion’ is in consonance with the current 
findings of the present study. 

Interestingly, this RM network showed the node radiating reticulation was 
comprised of only Dravidian speaking populations of the four states of the Deccan 
(Kamataka, Kerala, Tamil Nadu and Andhra Pradesh) and Orissa. This might be the 
second migration of L3* populations into Deccan relatively recent times ie. ~7Kya (as 
shown by BATWING) which marks the beginnings of the settled life the banks of 
Indus river in Mergarh (Gupta, 2004). A very recent split between Afghanistan and 
Andhra Pradesh (595 years ago) is in consensus with the historical events such as 
decline of Kakatiya dynasty and emergence of military powers, presumably the 
invaders bringing in this HG into Andhra Pradesh straight from Afghanistan. 
(Mohyuddin et al., 2006b) Identified new SNP mutation (PK3) specific to Kalash 
population who resides in the remote mountains of Hindu Kush ranges in North 
Pakistan. This population clustered with the Yadavas of South India.Yadavas of 
Andhra Pradesh showed a frequency of 3% for HG L3*-M357. Considering the 
Yadavas, one need to be reminded the Neolithic cattle keepers of Deccan, and the 
story of Lord Krishna and his Yadhu tribe that presumably ruled a vast expanse of 
India from Dwaraka. However Kalash stood distinctly in the MDS plot from all other 
Dravidian and IE speaking populations studied, indicating its distinctiveness. 

The present study revealed an ASD age estimate and YSTR variance of HG 
L3*-M357 that were higher in Afghanistan (15,200 + 4,400 and 0.31 respectively) as 
compared to other study populations. Afghanistan has been one of the important 
crossroads for human migrations, and an important stop along the Silk Road ancient 
days (Lacau et al., 2012). It was also one of the earliest known regions for 


domestication of wheat/barley, sheep, goat and cattle during Neolithic age. Further, 
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earlier NRY studies have suggested gene flow of HGs such as L-M20, H-M69 and 
R2-M124 between Afghanistan and India and these were mediated by IE (Dari or 
Pashto) populations such as Pashtuns or Pathans and Tajiks (Lacau et al., 2012). 

The global network of L3*-M357, using 9 STRs of the present study and those 
available in literature revealed further interesting points. Once again, the Indo 
European language speakers of India clustered distinctly with few East Caucus 
samples in the distant peripheries, while most of the Caucus-Chechen, Pakistan and 
Afghanistan samples shared haplotypes and clustered with many Dravidian speaking 
populations of Southern Indian states studied. The scenario suggested an evolution of 
L and L3*s in a common expansion of these regions from Caucus to Pakistan, but the 
composition of the median HTs and presence of Chechen of Caucus samples 
suggested later arrival to this from India/ Pakistan regions. 

Pathans, living in the south of Hindu Kush Mountains contribute to nearly 
42% of the total population of Afghanistan and 15% in Pakistan (Lacau et al., 2012). 
Afghanistan showed a total frequency of 9% of L3*-M357. They have been attributed 
a Jewish history by Ahmad (1952) and also Greek or Rajput ancestry by Bellew 
(1979); Caroe (1958). 

Haber et al., (2012) suggested based on MDS and barrier analysis that genetic 
affinity and gene flow between Afghanistan, north and west India were due to the 
interactions that could have existed since the establishment of the region’s first 
civilization at the Indus Valley and the Bactria-Margiana Archaeological Complex. 
This Afghan-Indian population structure excluded Hazaras, Uzbeks and South Indian 
Dravidian speakers. But the present study with respect to HG L3*-M357 only 
suggested an alternative hypothesis. The founder of L3*-M357 and may be Ls in 


Afghanistan or Pakistan, west of Indus barrier. Two routes of entry into India at two 
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different time periods, one to the north, an ancient one through passes in Hindkush 
ranges and another to south, presumably through coastal route, much later. 

Thus in the light of the fact that the two nucleus of HG L3* showing different 
directions of evolution of YSTRs it can be proposed that the peopling of Deccan and 
Northern India were not uniform and population movements from north to south or 
vice-versa were also scarce during ancient times. Higher resolution studies on the 
present cohort employing newer markers and whole genome scans will throw further 
light on the conclusions made based on the data set available here. We may also need 
higher coverage of samples from Pakistan and Afghanistan regions to decipher much 
accurate pattern of migration and peopling of India and the Deccan through Ls and 
other clades. 

5.5 The story of NRY HG Js as a marker for agricultural expansion: 

Indian subcontinent has a long history of agriculture. Wheat, barley, 
and jujube were domesticated as early as 9000 BCE; Domestication of sheep and goat 
soon followed (Gupta, 2004) and continued in Mehrgarh culture by 8000-6000 BCE 
(Baber, 1996; Harris et al., 1996). By 5th millennium BCE agricultural communities 
became widespread in Kashmir and Cotton was cultivated by the Sth millennium 
BCE-4th millennium BCE. Archaeological evidence indicates that rice was a part of 
the Indian diet by 8000 BCE (Nine et al., 2005) The Encyclopedia Britannica 
indicates that a number of cultures have evidence of early rice cultivation, including 
China, India, and Southeast Asia. All the more the irrigation technology was 
developed in the Indus Valley Civilization by around 4500 BCE (Rodda and Ubertini, 
2004). Archeological evidence has revealed animal-drawn plough dating back to 


2500 BCE in the Indus Valley Civilization (Lal, 2001). 
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Cavalli-Srorza et al., (1996) on the other hand has proposed that agriculture 
developed in Fertile Crescent about 15,000-10,000 Ybp, extending from Israel 
through Northern Syria to Western Iran. The mechanism of expansion was demic 
eastward ( human migration) (Cavalli-Sforza et al., 1996; Renfrew, 1988). It is 
hypothesized that the proto-Dravidian akin to proto-Elamite first spoken in Fertile 
Crescent were thus carried with demic spread and entered into India and subsequent 
migrations of Indo-European language speakers, the pastoral nomads from Central 
Asian steppes presumably replaced the Dravidian language speakers in the Western 
India. Anatolian theory on the other hand claims that IE languages spread from 
Anatolia (present day Turkey) with agriculture ~8000-9500 years ago (Gray and 
Atkinson, 2003; Bouckaert et al., 2012). What were the genetic markes of these 
people and their mode of dispersal are thus still debatable. The Central Asian pastoral, 
mounting the horse, discovery of wheels and Agriculture and its expansion are 
considered as important milestones in the dispersal of our species. 

The clinal patterns of haploid genome NRY, origin of agriculture and demic 
expansion in Europe have been explored in many studies (Cavalli 1994, (Rosser et al., 
2000). The J clade is present in appreciable frequencies in Europe, Anatolia, Middle 
East, Indus valley, souther India and Algeria in Africa (Hammer et al., 2000). The 
highest frequency of HG J is found many populations of Middle East, Iran and 
Algeria. The Caucasus—Anatolia and European populations have moderate 
frequencies (Quinta Murci et al., 2001). The age of this haplogroup is 14,800+9,700 
YBP (Hammer et al., 2000) while in southwestern Iran the age was 5,500—17,400 ybp 
(Quintana-Murci et al., 2001): suggestive of dispersal of this clade with agricultural 
expansion eastward. This age estimate is similar with the ages calculated for the North 


Indian populations (5,200:95% CI: 3,000-9,500) (Mukherjee et al., 2001;Quintana- 


124 


Murci et al., (2001) has thus suggested that this haplogroup may have been brought 
into India by Indo-European speakers from the Middle East. Cordaux (2004) on the 
otherhand has suggested a Central Asian origin, rather than west Asia. The studies on 
India were based on smaller sample size (89 J chromosomes from 4 north Indian 
populations by Mukarjee et al., (2001), 155 samples from 9 tribes and 1 caste in 
Cordaux, (2004), 7 from Gujarat by Quintana Murci et al., (2001) have shown 
sporadic distribution of J clade and does not conform to any particular state or 
linguistic groups. Hence the question of their dispersal in India and was addressed in 
the present study. 

The present study thus investigating in depth the sutypes of J clade: viz J2- 
M172 in 66 castes and 32 tribes from all over India under The Genogrpahic. Thus 658 
J2a*-M410 chromosome and 532 J2b-M221/102 studied for J2-M172 (designated as 
J2-M172 as per the ISOGG phylogenetic nomenclature, 2008) (Fig 70) and its 
subtypes ic. NRY HG J2a-M410, J2b-M221/102, J2a4a-M47 and J2a4c-M68 
revealing their distribution in India warranted a new interpretation. 

The J2a showed a very interesting pattern of maximum diversity and 
frequency in Karnataka region. The 17 STR RM network showing no median 
samples, but most of the Southern Indian samples in close to the hypothetical median 
and the Northern Indian samples in the periphery of the network implies an expansion 
from south to north. The 8 STR global network including J2a of Middle East and 
other countries showed a median constituted mostly by Lebanese samples and the 
radiating branches consisting samples from many countries and the other half mostly 
northern Indian samples were reflected very clearly in MDS plot. This indicated two 
different evolution of this clade with an early northern Indian founder and another 


southern Indian founder. In NJ tree also showed the Northern Indian samples 
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Figure. 70: Y chromosome phylogenetic tree of HG J with its defining SNP 
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irrespective of region and whether they are Brahmin or sikh or Rajput or Bill or 
Mourya or Himachalli etc., all clustered distinctly with much younger split, compared 
to the much ancient branch of International samples studied, strewn along Indian, 
Chinese, European samples. Further the UP Bhumihar and Mythili Brahmin were 
distinct from the rest. In the older branch, the Palestine samples interestingly were 
seen all over this branch, indicating the greatest diversity in them. This wider 
distribution implicates a rapid dispersal: a classical example was the clustering of 
Nattukottai chettiar samples with Lebanese and Palestine samples in 8 STR based 
network. 

Majumdar et al., (2001), proposed that the Brahmin populations (showing 
higher HG J-12f2a frequency; 23.5%) had genetic contact with Aryan-speaking 
groups. In India, Brahmins were the torchbearers and promoters of Aryan ritual 
(Karve, 1961). But Sengupta et al., (2006) suggested the predominance of HG J2 to 
be higher in Dravidian populations than Indo European, by considering the Brahmin 
populations of South India (Iyengar and Iyer) as DR speakers. The present study 
shows the predominance of HG J2a*-M410 among IE speakers (6.2%). Brahmins 
(20.6% of total HG J2a*-M410) of India were considered to be IE speakers in this 
study irrespective of their geographic or languages they presently adopt, whereas HG 
J2b-M221/102 did not show such characteristic language affiliation. (3.7% in IE, p 
value= 8.E-03 and 4.5% in SDR, p value= 2.E-07). Other IE populations such as 
Maldhari (58.3% but YSTR diversity was nill) and Parsee (33.7%) also showed 
appreciable proportions of HG J2a*. The interesting story of J2a-M410 was the 
clustering of various Brahmin populations of India with different Indian and world 
populations in the NJ tree, the much younger branch of Himachal, Punjab & 


Rajasthan populations showing distinct and later spread. The age of J2a ~ 20,000 ybp 
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fits well with the scenario the rapid spread without much sedentary lifestyle, but lower 
frequencies in most of the areas implying local amalgamations and sometime staying 
isolated. The presence of J2a*s in the Jammu, Himachal Pradesh, Punjab, Rajasthan 
belt is interesting, but no clue on their origin can be obtained. The more recent 
movement of people from Iran to India had been Parsee during 10 century AD as 
refugees to India (possessing 33.3% of HG J2a and 3.5% of HG J2b-M221/102) 
(Nanavutty, 1970) through western corridor into Gujarat and later migrated to 
Bombay province. 

Earlier studies compared the western Asian populations and Indians revealing 
that Indian populations showed low YSTR diversities within HG J (Quintana-Murci et 
al., 2001; Nebel et al., 2002). In the present study, Palestine populations showed 
highest YSTR variance and ASD (0.70 and 26,872 + 6,166). Whereas Lebanon and 
India (pooled) showed similar variance and ASD estimates (0.53 and ASD 
~20,500years) for HG J2a*-M410. However, the effective population sizes, TMRCA 
and population expansion times suggests that Middle East populations were more 
ancient and Indian HG J2a*-M410 is a subset of those YSTRs. Haplotype sharing was 
also observed between the Lebanon and Indian populations, thereby indicating the 
presence of same 8 STR haplotype in India. Hence suggesting exogenous origin of 
HG J2a*-M410 in India. 

In HG J2b the YSTR variances and ASD estimates were higher in Indian 
populations as compared to other global populations that are presently studied. But 
high microsatellite variation could also result of repeated gene flow. This can be 
observed in study states such as Maharashtra (for example) reports high variance but 
this was the result of diverse sources of YSTRs as reflected in their mismatch 


distribution. No such distinction of Northern Indian populations from the rest of the 
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samples in global Network and MDS plots were observed, but with 17 STR similar to 
J2a, the north western state populations were seen in the outer most layers of the RM 
network. The pooled ASD age for HG J2b-M221/102 among all Indian study states 
(17 YSTR) was found to be at higher limit of 19,241 +5,066 years. Sengupta et al., 
(2006) reported an age of ~13K ya for HG J2b2 and suggested the appearance of this 
HG in India before agriculture. Semino et al., (2004) reported a high frequency of HG 
J2b-M102 in southern Balkans and north-central Italy and suggested the population 
expansion from these regions, but Cinnioglu et al., (2004) suggested high STR 
variance in southwest Asia (0.33) contradicting the previous statement. 

Therefore, in conclusion the present observations on larger sample size and 
comparision with international data sets suggested an exogenous origin of HG J2a- 
M410 and J2b-M221/102, probably Middle East and spread to various regions of 
India differently, may be through different routes through coastal and Hindukush 
ranges. Also the extent of differentiation of HG J clades and their associated 
microsatellites has indicated the Middle East as its likely homeland. In this area, J- 
M172 and J-M267 are equally represented and show the highest degree of internal 
variation, indicating that it is most likely that these subclades also arose in the Middle 
East. The age estimates of HGs J2a-M410 and J2b-M221/102 suggest the appearance 
of these HGs during Mesolithic ages (~20Kya) which is prior to the beginning of 
agriculture. Also in India, two different evolution of YSTRs have been identified 
among the HG J2a*-M410 of north and south Indian populations. Sengupta et al., 
(2006) proposed an eastward expansion of J2a-M410 to Iraq, Iran, and Central Asia 
coincident with painted pottery and ceramic figurines, well documented in the 
Neolithic archeological record (Cauvin, 2000). Also earlier studies have indicated the 


movement of other material culture towards Indus. Hence the spread of HG J2 cannot 
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be correlated with agriculture only. Pottery, trade and colonization without substantial 
military intervention also drove wealth and technological and cultural development. 
Examples of ways that genetic migration was mediated might include the silk and 
spice roads, which connected China with the Middle East through to Europe, as well 
as to spice sources in India and Indonesia, and the Incense Road, which connected 
India through the southern Arabian Peninsula (Zalloua et al., 2008). The presence of 
J2s in Chinese popuations vouchsafe for this. 

Hence, no clue on a real agricultural expansion with J clade can be identified 
in the present study. This clade though older, must have thus been carried over by 
criss cross movements of people, with the advent of trade. In general later the HG in 
the phylogeny, more rapid was the expansion and the diversity, presumably 
depending on the success story of these populations empowered with every 
technological development. 

5.6 The story of Nattukottai Chettiars and fidelity of patriliny 

Various subclans (Kovils) of the Chettiars were quite distinct from each other 
and the same was reflected in the Y chromosome composition and structure analysis. 
The brotherly clans ‘Erani Kovil’ and ‘Pillayarpatti? showed a split from their 
common ancestor only 462 years ago. These clans possessed high L1 and showed 
isolated evolution of YSTRs within HG L1, as reflected in network. The other clans 
Mathur_ Uraiyur (HG H1la*), Surakudi (HG L1) and Illupakudi (HGs H1a*, L1 and 
Rlala) possessing mainly autochthones Indian lineages seems to have amalgamated 
around 2000 years ago. Whereas the other clans such as Mathur _Arumbakur (HG 
F*), Nemam (HG J2a and R2) and Elayathakudi Okkur (HG J2a) should higher 
coalescent time of 10,766 Ybp. The J2a* sources of Nemmam and Okkur were 


different as revealed by network and mismatch distribution. 
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Considering the oral history of Nattukottai Chettiars that they migrated from 
Chola country to the present territory, it is probable that the populations constituting 
Vairavan and Nemam Kovils with high fidelity of HG J2 clades have migrated from 
different direction to amalgamate with the HG L1-M27/76 which could have been the 
predominant pre existing group. It is known that L1 had a huge expansion in Deccan 
and Tamil Nadu harbours highest frequency and diversity of L1. In the L1 global 
network (Fig 40) the Nattukotai Chettiar showed a distinct evolution from the early 
population of Deccan (see L1 section). It stood separate from other dry land farming 
populations of Tamil Nadu (ArunKumar e¢ al., 2012) indicating their common origin 
well before the caste formation. 

The varna system was introduced into Tamil Nadu only during Pallava/Chola 
—post Sangam period, though a well stratified society and professional occupation had 
society existed as evidenced by various sangam literature (Sastry, 1975). The patriliny 
is known in many agricultural populations and is thought to have originated in 
Kazakhstan with the advent of settled agriculture and land holding the pattern of male 
inheritance came into vogue. The Dravidian society was indeed a female centric 
society was indeed and the male hegemony with patriarchal, patrilineal inheritance 
with land holding was introduced into Tamil society only during later Chola period. 
The Adeenam and Mutts as agricultural and spiritual centres were also introduced. 
The NC indeed were considered as the descendants of Kovalan and Kanagi, the main 
characters of Silapathikaram, of this great Sangam epic. The high standards of cultural 
evolution of NC with their materialistic view of the world or in a small territory made 
them as one of the most stringent followers of this patriliny and caste system. The 
multiple NRY HGs stringent to each clan speaks for the fidelity of their belief system. 


The adoption of culture of both Dravidian kinship and many Vedic rituals infact 
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support different waves of migration and amalgamation. Our recent studies showed 
that the societal stratification occurred 7000 Ybp and the various population groups 
did not admix during past 300 years. This indicated the occurrence of such unique, 
only one, NRY HG lineages in many of the clan studies; and thus two or three 
different and distinct groups with similar value systems should have migrated and 
amalgamated may be in quick succession. This is L1 affinity to Dravidian 
populations; J2a middle east populations, Rlala to vedic/Central Asian/North Indian 
populations and F* to ancient settler can further be unravelled by whole genome 
studies only. 

Thus the cast formation among Chettiars has been the result of multiple 
migrations and settlements. The genetic data was well supported by the oral migratory 


history of Nattukottai Chettiars. 
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CONCLUSION 


6. CONCLUSION 


The present study on the populations of Gujarat, Maharashtra, Karnataka and 


Andhra Pradesh along with the other studies on Tamil Nadu, Orissa from this 


laboratory, and those from literature have lead to the following conclusions and 


deciphering the factors that determined in peopling India. The important findings 


from this study are thus presented below: 


L. 


The genetic structure of various study states of Deccan and Gujarat are not 
uniform 

The parameters that determine the genetic structure were not the same in 
various study states and thus India cannot be considered as a single gene pool. 
The castes and tribes of Gujarat were distinct in their NRY genetic profile and 
the variation between them was high among Gujarat populations. 

The language and geography were the most important isolation parameters 
among Maharashtra study populations 

Most of the Karnataka study populations showed high NRY HG variance and 
thus no one to one correlation with either with language, geography, caste- 
tribe divide, subsistence or other social characteristics. 

The split ages of the study populations from Karnataka obtained from 
BATWING analysis indicate that the agricultural based populations was 
~4.5Kya which correlate with the ‘ash mound tradition’ of southern Neolithic 
culture and cattle keepers (Fuller, 2006). 

The study populations of Coastal and Godavari belt of Andhra Pradesh were 
structured based on their mode of subsistence, with no recent gene flow for 


atleast ~1.7Kya. 
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8. 


9: 


10. 


1, 


12. 


The ages of study populations of Gujarat and Andhra Pradesh correlated to the 
period of millet (~4.5Kya) and later rice cultivation (in Andhra Pradesh: 
~2Kya). The existence of agricultural societies in these regions fits well with 
the advent of the Varna system itself (1.2Kya) 

Not all the tribes had a shared ancestry. Different tribes among the study 
region showed different origins (as evidenced by Siddis, or Korku). Some 
showed a shared ancestry with other populations. 

NRY HG L1-M27/76 may further be qualified as a marker for South 
Dravidian populations, extending the study of Sengupta et al., (2006)) 

Tamil Nadu may thus be the candidate of origin or early settlement and 
successful expansion of HG L1-M27/76 in India, with a small effective 
population size and a maximum population expansion time (17Kya) and 
seeding to other regions of India. However the hypothesis of proto-Dravidian, 
demic diffusion, along with agriculture as suggested by Renfrew (1988) 
cannot be eliminated. The presence of Syrian, Afghanistan and Pakistan 
median haplotype and and their wide distribution (at 9 YSTR resolution) in 
Himachal Pradesh, the coastal Gujarat and Maharashtra and all over the 
Deccan, but not Uttarpradesh and to its east, may still suggest an their origin 
of this clade in the far west of India and a spread towards India. 

The HG L3*-M357 distribution patterns suggested two alternative migratory 
routes from the land of their origin in Afghanistan or Pakistan, or further 
West. Two routes of entry into India probably took place at two different time 
periods, one to the north, an ancient one through passes in Hindkush ranges 
~18Kya, settling in Himachal Pradesh and another to south, much later 
(~7Kya) presumably through coastal route, much later, the samples 
represented from Chechen-East Caucus- Dagestan, Afghanistan and Deccan 
(8str global network). 
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13. 


14. 


1D: 


16. 


Analysis of HG J2a-M410 and HG J2b-M221/102 suggests exogenous origin 
of these HGs in India, probably Middle East. These HGs probably spread into 
India in different routes. 

Diverse YSTR evolution of HG J2a*-M410 have been identified among the 
north and south Indian populations. Whereas such patterns was not observed 
in HG J2b-M221/102. 

The ASD ages of HGs J2a-M410 and J2b-M221/102 were calculated to be 
~20Kya and did not did not provide any clue for real agricultural expansion 
within J clade in the present study. The dispersal of these clades in India can 
be attributed to the pottery, military intervention or refugee also. 

The patrilineal clan system of Nattukottai Chettiars clearly showed the fidelity 
of the Varna system in them, the strict adoption of patriliny of inheritance of 
their ‘Kovil’ (Temple=clan), each clan having mostly one HG. Their oral 
tradition of rehabilitated from Kaveripoompatinum to the present expanse of 
‘Nattukottai’ might be true. Nonetheless there are signals of various HGs and 
their dating indicating that an incoming group with J clades (Vairavan Kovil 
and Nemam), the traders might have amalgamated with a preceding arrival, 
earlier settler having preponderant L1, quite distinct through from other L1s 
present in other populations of Tamil Nadu and the presence of Rlal only in 
Kalanivasal, Vaizhnava worshipper of Elayathakudi may be a later addition. 
The age calculations of these clades in NC show their caste formation during 
various periods J2, Ll, & Rlal. Presence of only F* in one clan and O2a in 
another clan are further indicators of amalgamation of local and distant 
populations at the time of caste formation or much later. The oral tradition of 


referring Erani and Pillayarpatti clans as “brotherly” and do not inter-marry, 
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ie 


corroborated well on various counts: occurring together in NJ tree, sharing L1 
haplotypes and their haplotypes and showing young split time of 462 years in 
BATWING analysis. 

The study brought out the diversity of Brahmin populations. Various Brahmin 
populations from different states studied, did not isolate themselves. In most 
of the analyses, they were seen mixed with other caste and tribe populations in 
networks and trees. Nonetheless, various Brahmin populations shared specific 
HTs with different Indian and global populations suggesting their affinity and 
origin with these people and their land: Most striking of these were, the 
median in J2a cluster — etc., 

Thus, the study population thereby reveal different histories and distinct 
genetic legacy of various populations and states. The evolutionary factors and 
genetic phenomena operating on them were not the same. The structure of the 
populations was laid quite earlier as demes in various regions much earlier 
they were identified by different names and the advent of Varna. Further 
analysis with increased sample size from global populations with higher 
resolution is warranted. Mt DNA and whole genome analysis would further 


throw deeper insights into population histories. 
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Appendix le: Ethnographic notes on Nattukottai Chettiar 


Nattukottai Chettiar are the notable business community of Tamil Nadu. They 
have a known history of migration settling in Pandiya country ~800 ybp in a defined 
500sqkm, 50-100 km east of Madurai (Chettiar, 1874). As on date they practice caste 
endogamy and clan (patrilineal) exogamy, each clan affiliated to a defined temple 
donated by Pandiya kings. They are staunch Saivites, patronizing Sanskrit and Veda 
Padasala (Vedic schools). Many ‘Nayanmar’ poets and scholars of Bhakthi movement 
were known in them and the main characters in ‘Silapadikaram’ (~2kya) the iconic 
epic of Sangam period, centred around them: both Jain and Buddhist philosophies 
were celebrated in this. Nattukottai Chettiars have been an enterprising, sea faring 
merchant community to the Far-East since the Chola period. They have built palatial 
mansion, unique summer palaces of Indo-European architecture, with central open air 
quadrangle and water harvesting technologies, and live with pomp, great philanthropy 
and hospitality even today. They practice Vedic rituals in marriages, cremate their 
dead, employ Brahmin priests and adopt 16 day pollution and purity period. They also 
have imbibed many local ‘Dravidian’ cultural elements such as uncle betrothing the 
marriage rather than the father giving the daughter as a gift in marriage 


(‘kannigathan’, as practised in Brahmins) and celebrate ‘puberty’ of girls. 


Culturally, the population is divided into 9 patrilineal clans with their own 
temples and three of them, further subtypes — again patrilineal clans for all purposes 
of marriage and inheritance thus totalling to 28 clans. The names of these kovils 
designate the God and Goddeses they worship in the place of their original 
settlements. In 2001, a total number of 30,941 families live in their 76 settlements, 
villages, townships as on date. Forefathers of this caste were granted this land by 


Athiveerapandiya. 


Appendix — 2 Informed Consent Form 


‘GENOGRAPHIC - INDIA’ 


Madurai Kamaraj University - School of Biological Sciences 
In collaboration with NGS-IBM-The Waitt Family Foundation 


Volunteer Informed Consent form for obtaining Human tissues for Genetics Research 


l. 


The purpose of the Genographic study, carried out by Madurai Kamaraj 
University was explained to me in my local language; 


I accept that that I may not obtain any direct benefit out of the said research 
and the results will be available in public domain; 


I agree that my results will be kept confidential and identity anonymous; 


I have the liberty to opt out and request you to withdraw the results and 
samples from the study at anytime; 


I understand that the blood / mouth wash / cheek swab collected from 
myself will be used for DNA based tests that may facilitate better 
understanding of the human genome, migration, evolution and related 
genetic aspects; 


I accept that any results arising out of the research shall be published by 
the investigators for a better understanding of any genetic phenomena; 


I volunteer to donate my blood / mouth wash / cheek swab / saliva for the 
said study: I am aware of the minor discomfort of veni-puncture and I may 
or may not be compensated for the lost wages, discomfort etc., 


I agree that a portion of the samples may be stored in a repository and used 
at a later date for similar non-profit genetic tests; 


I freely and voluntarily chose to participate in the study based on the 
explanation provided by the interpreter / community leader, and hereby 
give my informed consent. 


Wage Compensation for lost working hours / travel expenses / subsistence — 
received / not received. 


Name of the Volunteer: 
Address: 


Signature of Volunteer 


Name of Interpreter/Community Leader 


Signature of Interpreter/ 
Comm. Leader 


Name of Sampling Team Leader: 


Signature of Team Leader 


Date: 
(2008/01/3000 copies/SMS) 


Appendix 3: Details of Expeditions undertaken, Advisors and samples collected during the study period 


Advisors and |Expeditio| },,, | District of Populations N_ |Sampling [DNA 
Colloborators |n collection sampled Collecte |Team extractio 


NRY HG |NRY STR 
Genotypin |Genotypin 


1/6/08, |Tirchy, Nattukottai VIK*,AS*/ 
17/7/09, | Coimbatore Chettiar 171 venvent AS/VSA/ 
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*RMP Prof RM Pitchappan Dr Veeraju Andhra University, Vizsg 
*VJK DRV J Kavitha Dr Velaga lakshmi Dept of Genetics, Andhra University, Vizag 
* AS Adhikarla Syama Dr Mahan Kali Dept of Genetics, Andhra University, Vizag 
*VSA Vasanthakumari Varadarajan A Dr Sambasiva Rao Dept of Anthropogy, Andhra University, vizag 


*GAK Ganeshprasad Arun Kumar Mr Vasudevan Bangalore 


Air marshal Chengappa Bangalore 


Dr Gangadhar Dept of Anthroplogy, Manasa Gangotri, Mysore 
Dr Bhat Dept of Anthropology, Manasa Gangotri, Mysore 
Mr Arun Kumar Suvarna Mangalore 

Dr Agarkar Homibaba Cenre for Science Education 

Dr V Gambhir Director, Sholapur Science Centre, Sholapur 

Dr Uma N Rao Director, IWSA, Mumbai 

Dr Kamaloorkh Marolia Assistant Proffesor, K J Somaiyya College, Mumbai 
Dr Vidyanand Khandagale Assistant Proffesor, Shivaji University, Kolhapur 
Dr Mumtaz baig Reader, Amaravati University, Amaravati 

Mr Ashok Vyas and Mr Ashok Mel Agakhan NGO 

Dr Krishna University of Baroda 


YHG L-27/76 and YHG J -304 subtyping for all the samples collected by Genographic India were performed by AS (self) 
Tamil Nadu and Kerala samples studied by VJK was used for comparitive study with my study populations 
Orissa samples studied by VJK and GAK were used for compatitive study 


Appendix — 4 


GPID a "Date of Sampling 
GENOGRAPHIC - INDIA 
Madurai Kamaraj Univesity 


Volunteer Enrolment Form 


Address (Permanent) 
Door No 


Location / Sampled place. _ 

Gender (Sex) 

Native Language 

Ethnicity (Caste) : Subcaste / Gotram : 
Place of Birth : 
Age/DOB _ 


Ethnicity 
Place of Birth 


Father : Native Language 
13 Ethnicity 

Place of Birth: 
Maternal GM Ethnicity : : : 
Place of Birth 


Maternal GF Ethnicity 
18 Place of Birth 


19 Paternal GM Ethnicity 
Place of Birth 


Place of Birth 


If married / wife/husband belong to the same village / nearby Km 


Whether your Parents. related before marriage? Y)es/N)o 
If yes, U)ncle - Niece / F)irst Cousin / D)istantly related 


Sib - ship size : 1) 2) 3) 4) 5) 6) 7) 


Any other observation 


Sample: Mouth Wash ~ ; 7 ; ; 


Sampled By Date 


‘2008/01/3000 copies/SMS) 


APPENDIX - 5 


GPID Date of Sampling 
GENOGRAPHIC INDIA 
Volunteer Enrolment Form - "Nattukottai Nagarathar" 


Address (Permanent in Chettinad) 


treet: 
Town/Village 


Pincode 
State 
Others 


4 
5 


Gender (Sex): Male 
Native language Tamil 
Ethnicity (Caste) Nattukottai Chettiyar 
Subcaste/Gotram/Kovil 
Place of Birth 
Age/DOB 
Mother: Native language : Tamil 
Kovil 
Place of Birth 
Father: Native language:Tamil 
Kovil 
Place of Birth 
Maternal GM Kovil 
Place of Birth 
Maternal GF Kovil 
Place of Birth 
Paternal GM Kovil 
Place of Birth 
Paternal GF Kovil 
Place of Birth 


7 


9 


1 


| 4 |Gender(Sex): Male 
| 5 [Native language Tamil 
| 6 [Ethnicity (Caste) Nattukottai Chettiar 
| |Subcaste/Gotram/Kovil 
| 7 |PlaceofBirth 
| 8 |AgeDOB 
| 9 Mother: Nativelanguage: Tamil 
p10 | Kovih 
Pui] Placeof Birth 
| 12 |Father: Nativelanguage:Tamil 
Pas | ov 
p14 | Placeof Birth 
| 15 [MaternalGMKovil 
| 16 | Placeof Birth, 
| 17 [MaternalGFKovil 
pis | Placeof Birth 
| 19 [PaternalGMKovil 
| 20 | Placeof Birth, 
| 21 |PaternalGFKovil 
[22 | Placeof Birth 


Whether your parents related before marriage? Y)es/ N)o 


If Yes, U)ncle- Niece/ F)irst Cousin / D)istantly related 


Your Sib-ship size: 1) 2) 3) 4) 5) 6) 7) 


Any other observation 


Sample: Mouth wash 
Sampled By Date 


Appendix 6 
‘“GENOGRAPHIC-INDIA’ 


Madurai Kamaraj University — School of Biological Sciences 


In collaboration with NGS-IBM-The Waitt Family Foundation 


SAMPLING TEAM 
Name of the Population: 
Name of the City/Village: 
Taluk: District: 
State: PIN 


3. Group Leader/ Village Head: (Team leader / next in command should explain the purpose, study design and about 


legacy project and obtain consent from the individual/ Clan/ Group/ HouseHold/ settlement/ village Leader or asap: Sampling to be done 


only on voluntary basis: no allurement. Compensation for loss of wages and transportation may be paid) 


Name: 
Address: 
City/Village: Taluk: 
District: INDIA / 
PIN: Tel: 
Signature: 
4. Field Work Support By: Name of NGO / personnel / Local Scientist 
Name 
Address 
PIN: Tel.No: 
Signature: 
5. Sampling Team: Team Leader: 

Doctor(s): 

Technician(s): 

Student(s): 

Field Work Asst(s): 

Other(s): 

No of Samples Collected: 

Notes/Comments: 
Date: Station:_ Signature - Team Leader 


Appendix 7 — Village Document 


Date of Sampling: 


‘“GENOGRAPHIC-INDIA’ 


Madurai Kamaraj University — School of Biological Sciences 


In collaboration with NGS-IBM-The Waitt Family Foundation 


SAMPLING - VILLAGE DOCUMENT 


1. Name of the Population(s): 
2. Name of the Settlement: 3. Village: 
4. Taluk: 5.Dist: 6. State: 
7. Co-ordinates: Longitude: 8. Lattitude: 
9. Altitude: 10. Vegetation: 
11. Climatic Conditions: 
Period(months) Temp Max/Min Rel. Humidity Rain fall 
At the time of 12) ; 13/14) Celcius; 15) %; 16) cm 
Sampling: 
Summer: 
Winter: 
17. Number of the Household: 18. Population size: Total: 
19.Male: 20. Female: 21. Children 


22. Subsistance mode: Foraging / Agriculture / Labour / 


23. Village Economy: Foraging / Plantation Labour / Coolie / Agriculture / 

24. Nature of Housing: 25. Public Toilets:No 26. Wells:No 

27. Terrain: Plain/ Hill/ Mountain/ Slope/ Low lying/ Sea coast/ Shrub jungle/Desert/ Forest/ 
Others 

28. Provisions: Govt. Drinking Water : overhead tank / well / stream 


29. Balwadi: Y/N 30. Nearest Hospital: km 

31. Transportation: Trek / Road / Bus / Other 32. Bus frequency trips / day 
33. Electricity: Village Yes/No 34. House-supply: Yes/No 

35. Fuel : Fire wood / Kerosene / Gas 

36. Nearest shopping «km 37. Township km _=- 338. Shandy sd km 


39. Place name: 40. 41. 


42. Society: M)atriarchal / P)atriarchal 


43. No & Name of Tribes/ Castes living in the village: 
1) 2) 3) 4) 


Clans: 


44. Language spoken: 45. Written script: 


No of Educated persons in the village: 
46. Graduate 47. School final 48. Elementary 
Professional: 49. Doctor 50. Engineer 51. Agriculture 
52. Computer IT 53. Management: 54. Post Graduate 


Any Comments: 


Oral History of Origin and Migration History: 


Cultural details: Festivals celebrated: 


Chief Deities: 


Marriage rituals: 


Birth rituals: 


Puberty Rituals: 


Death Rituals: Burial / Burning / Burial site 


Chief Diet: 


Artifacts used: 


Musical Instruments used: 


Village Document filled By: Signature: 


Appendix 8 : Compositions of Solutions used for DNA extraction 
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Appendix 10: List of Y STRs detected using Y filer Kit 


Positive 
Locus Allele Range | Dye labelled Repeat Motiff Control 


Al 
DYS456 13-18 TCTCT | 15 | 
DYS3891 10-15 TCTG)3(TCTA)n| 13 | 
FAM [Blue] 

DYS390 18-27 TCTA)2 (TCTG)n 
DDYS389II 24-34 
= 7: | 

2 


7-25 
8-16 
7-13 
8-15 [NED [Yellow] 
20-26 
7-18 
Y GATA H4 8-13 
13-17 
23. | 
17-24 


List of Y STRs and indels detected in Multiplex 2 assay 


Type of 
Locus polymorphis | Dye labelled 
m 


DYS426 
DYS388 
M139 5G to 4G 


AGAT 


TCTG 


Q 
| 

4 

> 


GAAA 
DYS458 14-20 GAAA 
DYS19 10-19 VIC [Green] |TAGA | 15 
GAAA 
‘AT 


T 
TAGA 


| 13 
i 
| 12 
TATA | 24 | 
| 13 
| 13 
| 15 | 
| 12 


lele 
15 
13 
4 
9 
17 
15 
13 
11 
12 
4 
13 
13 
15 
TTTTC 12 
19 


> mel 
a > 
> 
=] 


(oe) Iz] (2) 9°ET (8°) ST (T's) $6 (7D OLE (L°2) LOL (6) 16 Besa 
Goro COLT ©) TI Ts) ¢81 OES SIS PATEL] pe 
(Fp) ELT (€0) 76 (O16 LpooquieN} 3 

Jepeuuey 5 
(19) $97 (€€) oT Sb) FIT (6°) 16 Ten 
(eo rot] OO TI (9"r) TOL (LO StS TL (9'r) OST 9) 6°91 (Cb) IZ (78) 6°LZ (re) L’st}'p) Srt9) ooz(S) Tet h) Ezz] Ch) 07 pajood Ysapead BAypUy 
(case | s) eT () 8b1 (LOD 7 @DLY (¥'0) L'0 PUe 
(SH) TET (os) C11 Oru V PAEpeA 
gose | @DSOI (26 (36) €07 ndey 
(PE) TTI Oss ANV Uluyeig] 5, 
G@osi | rr (6L) 8°S7 (Ep) LOL ($9) 6ST (yO 16 (eg) EST wlreqnies = 
apsz | Dre TLL O1D9E eprawig uwuyeig| 
ODLL (E72) oS (OSE @sr (9p) ELT nfey y 
($2) 6 (TO €9 (TL) re7 (7S) 9°91 BIBUILIE YY BPUOy] 2 
(Z€) 8°01 (2) 9FI (€'€) 891 Appoy wpuoy] = 
(LO) € (80) TZ (Zo) s°0 (SO) I (Ds 0 ueyer 
(6S) TEL (by) VEL BBIpEN, 
js) ve | (2) 8'b1 60 €8 [ (Te) LOL mea 
(peer | (Cr) 8ST (p) 9°61 @DL9 (€9) 6 €7 LL) 6'0€ Pe 
(sp) er | (89ST [L) 9°7H'S) TST (F7) 8°6 ¥°0) OEL (o'r) OTE F) FLT Sb) LET (8b) S81 (60) TTL (780) ¥'8e'b) SST b) SLUE'O) CEES) LIZ pafood eyejeusey 
OO | Cr) sor (TO) EL 9) 917 eqniny 
(L'€) 6°6 BBRIOY 
(“ps or | (OTs (L'2) 66 9) $6 Paria 

(Sas (TU (DOL (ZCI) TET (Ds ODEr eypAaey UILIYyeg 

(ve) roe} (69) SLI (s'b) ¥'SZ (6's) 9°07 (GL) €°7#'8) STE PABIOX a 

(DE (16) 6ST (by) V1 eqningnuar) 3 

(Sa) 76 (€) L'O1 (lb) ppt Os ofrD IE eqereurespy | & 

(g'6)oe | W9)TEL @ss (Be) STI ('s) rez =pMon) FF 
COD rLt | O)TIc OPER (LZ) $8 (LO 8 95) 1:07 (€'€) STI BIQOARHO 
asst] (esl fore (Vp) £17 (C6) FET “Des syung 


BAEPOY 


Brera 


PEeEO 


qe | rece 


(CE) VEL (S72) 18 (QL) T6I ($7) 9°6 Wpemserespnoy uUEyeIg, 
(FZ) S01 [ “OSL () 86 TesuaAy 
Gogo EL | (LOD TLT “LO Lu (9°72) SOT (POLL (L's) €'81 (ZS) $7 (8°b) 797 (€"€) €1 (PD 78h (p'b) T0Z pajood e.nysereyeyA, 
(0) $6 TIO 
('€) 01 (+) TEL () 18 Te |Oy 
(36) 76 (S$) SFI (US) rh (rE) LOL SIT THEAY 
ODLE (60) Le (s°9) 797 (8%) VOI Hexeyy 
(Q) 911 puop ey] = 
(b) SPI (6L W pucp| = 
(90) ST (8°€) 601 Bae] = 
Tae_Oy| = 
dort | COET Te) TOL eRe] © 
(CH) ror aesueyq 
Dre | @PrZI (9°) 76 WRABAIIY UTUNyeIg, 
(ys)99T | (WE) TET PS) ost (DPE (“0 spl) v'0E aesied 
(PID €LZ eYyseYysog UTUTyeIg 
(r9) For | OD OIL (72) 8°8 (V) OFT (S's) 691 (98) FLT (Z"8) TSZ (€7) €:0T [(9) E17 (Lb) TSZ pajoog yeavlns 
(CD SL qndiey 
(CO L8 TAYE) 
(L°0) 8°S unduog urnqerg 
(Des TINY UTenErg, 
(8D 71 ulep 
a : Qa 
(Ve) TEL GQdTL red] = 
(+1 Heap 5 
(DTE Tosa] 
BMIEY 
(Z's) EST TOE9 BIEMIOy 
(CS) Ter (2) 6 (r'2) 68 (ors) STZ (©) O11 vIpOyey 
apzez [| (Dro (Te) $6 (L's) 677 OTL BABS 


uoyendog 


<<< OH AW 


suonerndod Apnjs snoriea Suoule sayeulysa a3v pasegq (GSYV) SeudIIIJIG WIeNbs ase19AV :[] xIpusddy 


“eAY Ul JOLIG prepueys oyeorpur () S}OYoeIQ OY} UIIIM sonyeA 
vAY Ul pouoruoUr ore Sosy , 
¢< Sodus 10} HH yor orp 10J payeynoyeo sem o3e GSV x 


(Lp) 9z|_ (d) O'STP) 8 ($2) 6 (6°€) €9T (Ee) ST (8°9) TOE] (ID ere (8°2) s‘r1['9) 66ZhD 8'6r]'9) E1E]'S) L’ze]'6) Vee] h) FEL pajood npeN [ue |, 
WOT Cor QO SIT BaVYSeMOS 
(€O 6€1 (S's) L07 vurepeA UlUyeg 
CO LEL wreuRreyseyeig UNUIe Ig 
OL) Toe] (1) or (9) €@ (€-@) Lot's) SEE NI Bavypey 
Goce] (Orit (6) 6FT (98) £6 9) 9°Szf'8) I've aeAyuUeA 
(€°9) SLI TO 99 (DLL (Te) 911 (ep LT (ZS) 77 OU 
FDP epouL 
sules [uel 
(SO LL (Tp) 9'EL aeypey TepeuTeT 
@TU (6S) 12 (19) Lez (L'0) O11 qekered 
(OL) SI (811) 9°07 qeavreg| 
(L'y) 9°7e((b) TL vaqued| 
(Cb) 781 reed] Z 
(re) 78 (€) °8 (€) oP I]'6) erz ueAled & 
(s)rol| (€€) S6l QaeEs (Se) rel (L'€) O11 (L's) £07 (S61) 8'rr]'8) E916) 9°ZE aepen 
TeAmMpNyL 
ONez] Wor) Ise (i's) 1 5) 8°81 (1°) 68 (6b) 797 TRARIE| 
(H78 (SS)LOI joo equity 
($29 Bnag equininy 
(o9) 8 equininy 
(go x0] (TH SET Drs BIOS 
DIL (L°6) £07 (pL) VEZ UeyoreUN ey 
ueIeyIURy 
(SO) TT (60) £7 (yr) 9°9 aepey 
(TOE (C61) 0€ (Cp) EFT en 
(SET) E57 (v2) IT (9) TOT (ZI) Sz (3°27) 96 (6b) SLT BARYZA 
(80 OTT (L°0) #°ST (L) 8°LZ (6°S) SIT (61D F°SZ (Eh) SET Pood ESSIIO S 
+8 (or) £6 qeXepueyy 
(L0) 17 BOL BAO! 
(Z€) FOL 097 ypuoy eisu0yq) © 
(8°€) FET (¥'0) FO emog eauey| 3 
(€) ST eqepep| * 
(Ep) SIL (1's) Vel JaMOT Opuog 
(62) 6'8 (€0) 8's opuog 
(ep) Lt] (2°20) $71 OST (L'€) LOT (FL) O81 (9°€) 8°61 (L°9) SLT (€°€) F°ZIf'S) F'Ez|"9) O'SZ (8°L) 6°07 aod BSSIIO N 
QO S8 (70) 66 expe UIMIBIG 
enpeyf UTP Ig 
(¢°7) 9°01 (90) TT (US) LUT (pe) LOT|9D L’s|'s) 707 worlo] 
(£2) 8°6 (F206 (S) S11 HO PUD] & 
BDTs TOS Teulug| 3. 
LO 811 “D&T (CQ SL (CTD OFZ uesty] © 
(€€) E71 DTE (£0) Pl (L'€) LOL b)9'¢T Tey0eg 
(8) 66 (0) LT (Sd) FOr (9°02) L'6 epuny| 
COs (90 L'6 oH 
au vera | xa [eetd | 10 ad frPevco] seveo | ezo | vet | eed | xza | aM | ace [oreze [rece | eer [x wTH [tein] «tin [| «tH | «H D ad [ 99 wo [ua | a Vv uonendod 


Appendix 12: Population codes 


Region 
Karnataka 
East India 
Europe 
Argentina 

outh Pakistan 
Afghanistan 
Maharastra 
Karnataka 
Rajasthan 
Karnataka 
West India 
Bihar 
Pakistan 
Pakistan 
Rajasthan 
Uttar Pradesh 

pain 
Karnataka 
N.ORISSA 
Tamil Nadu 
Rajasthan 


East Caucasus 
Lebanon 
Lebanon 
Lebanon 

West Caucassus 
Middle East 
Middle East 


Tamil Nadu _[Ezhave—————*d 
Rajasthan 
Kamataka 


Near East 
Tamil Nadu 
Near East 

outh India 

outh India 
N. Africa 
Pakistan 
Assam 
Andhra Pradesh 
N. Kerala 
Andhra Pradesh 
Maharastra 
Gujarat 
Karnataka 
Gujarat 
Maharastra 
Andhra Pradesh 
Karnataka 

outh India 
Karnataka 
N. Africa 
Pakistan 
Andhra Pradesh 
Gujarat 
N. Kerala 
Maharastra 
Rajasthan 
Karnataka 
Middle East 
Middle East 
N. Kerala 
Tamil Nad 
N. Kera 
Tamil Nadu 
Tamil Nadu 
Tamil Nadu 
Tamil Nadu 
Tamil Nadu 
Afghanistan 


[=] 


Trula 
ear East 


Mala 


[ 


ZB BB BRE ee bb bboeaeeshpbongaeeee 
© © 
2lalalels|SlBlEIS|a/Sl2/s/slZle| le] Fe z/S =/ 5) >|8|S 8 
B/S E)s|/S/8)2)& s/2|o 1s] 8 2/o |S S B/S to 
B)EIEI< B/s|s 5 Slopes s|&| 2. Ss 2/8 es 
nln|s 3 o S 
ae" : 
= = 
E.| 
ra) 
+ + 


adarCape 

ir 
NC_EraniKovil 
NC_Illupakudi 
NC_Mathur/Manalur 
NC_PillayararPatti 
NC_Surakudi 
Nurestani 


| 
fab} 


vure 
e 


Middle Fast [Palestinian —*ial 


Pallan————————*ia 
Afghanistan 
Afghanistan Pash 
Tamil Nada [Parayar [Pay 
Tamil Nadu [Printer [PK 
Andhra Pradesh [Raja ———*dR 
Rajasthan 
Andhra rade 


Punjab ikhJatt 
Pakistan 
Pakistan 

pain 


yrian Syrian 


N 


Nn 
4 


e 


nlnlalalalan 
cs B 


Afghanistan 


Afghanistan [Tak ———=* 
Tamil Nadu TamJ 
India 
N- Kerala 
Fast Asi bean ————SSSCSCS~«* 
NAc NAtrice = 
Turkey Turkish ——SS~=* 
Fast Avia Cygur «de 
Afghanistan Uzb 
Tamil Nadu __[Valayar———=*V 
india Vanniyar——SS—=dVa 
TamilNadu_[Vanniya—————=(Van 
Maharastra War 
West Eurasia Weu 
China bo S*i 
Andhra Pradesh Yad 
Tamil Nadu Yad_T 
Kamatake _(Yenva—«SN 


S 
S: 
Be 
S 
[ haart 


APPENDIX - 13 


Jenu Kuruba tribe at Nagarahole forest, Genograbphic sampling team at 
Karnataka Sep 2008 Nagarahole forest, Karnataka Sep 2008 


Yerava tribe from Kutta, Coorg, Mogaveera volunteers at Bolar 
Karnataka Sep 2008 Mangalore, Nov 2008 


er 
= 


Mandyam Iyengar volunter at Bangalore, Warli tribe at Nareshwadi, Maharastra, 
Nov 2008 Nov 2010 


Korku volunteers at Gavilgad, Maharastra Students volunteers at College sampling 
Nov 2010 at Gavilgad, Maharastra Nov 2010 


Kathodia, Gujarat Oct 2010 Kotwalia, Gujarat Oct 2010 


Kutchi Brahmin, Gujarat Nov 2010 Siddi, Gujarat Nov 2010 


4 


Siddi, Gujarat Nov 2010 Siddi, Gujarat Nov 2010 


~¢ 
7 | 


Sompuri Brahmin, Gujarat Nov 2010 Sompuri Brahmin, Gujarat Nov 2010 
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Abstract 


Previous studies that pooled Indian populations from a wide variety of geographical locations, have obtained contradictory 
conclusions about the processes of the establishment of the Varna caste system and its genetic impact on the origins and 
demographic histories of Indian populations. To further investigate these questions we took advantage that both Y 
chromosome and caste designation are paternally inherited, and genotyped 1,680 Y chromosomes representing 12 tribal 
and 19 non-tribal (caste) endogamous populations from the predominantly Dravidian-speaking Tamil Nadu state in the 
southernmost part of India. Tribes and castes were both characterized by an overwhelming proportion of putatively Indian 
autochthonous Y-chromosomal haplogroups (H-M69, F-M89, R1a1-M17, L1-M27, R2-M124, and C5-M356; 81% combined) 
with a shared genetic heritage dating back to the late Pleistocene (10-30 Kya), suggesting that more recent Holocene 
migrations from western Eurasia contributed <20% of the male lineages. We found strong evidence for genetic structure, 
associated primarily with the current mode of subsistence. Coalescence analysis suggested that the social stratification was 
established 4-6 Kya and there was little admixture during the last 3 Kya, implying a minimal genetic impact of the Varna 
(caste) system from the historically-documented Brahmin migrations into the area. In contrast, the overall Y-chromosomal 
patterns, the time depth of population diversifications and the period of differentiation were best explained by the 
emergence of agricultural technology in South Asia. These results highlight the utility of detailed local genetic studies 
within India, without prior assumptions about the importance of Varna rank status for population grouping, to obtain new 
insights into the relative influences of past demographic events for the population structure of the whole of modern India. 
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Introduction Africa [1,2]. Indian populations are broadly classified into two 
categories: ‘tribal’ and ‘non-tribal’ groups [3]. Tribal groups, 
constituting 8% of the Indian population, are characterized by 
traditional modes of subsistence such as hunting and gathering, 


Contemporary Indian populations exhibit a high cultural, 
morphological, and linguistic diversity, as well as some of the 
highest genetic diversities among continental populations after 
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foraging and seasonal agriculture of various kinds [2,3]. In 
contrast, most other Indians fall into non-tribal categories, many of 
them classified as castes under the Hindu Varna (Color caste) 
system which groups caste populations, primarily on occupation, 
into Brahmin (priestly class), Kshatriya (warrior and artisan), 
Vyasa (merchant), Shudra (unskilled labor) and the most recently 
added fifth class, Panchama, the scheduled castes of India [2,3]. 
Generally, both non-tribal and tribal populations employ a 
patrilineal caste endogamy. This practice, together with the 
male-specific genetic transmission of the non-recombining portion 
of the Y-chromosome (NRY), provides a unique opportunity to 
study the impact of historical demographic processes and the social 
structure on the gene pool of India. 

The distribution of deep-rooted Indian-specific Y-chromosomal 
and mitochondrial lineages suggests an initial settlement of 
modern humans in the subcontinent from the early out-of-Africa 
migration [4,5,6,7,8,9]. The greater genetic isolation of many 
tribal groups and their differences in Y-chromosomal haplogroup 
(HG) lineages compared to non-tribal groups, have generally been 
interpreted as evidence of tribes being direct descendants of the 
earliest Indian settlers [2,10,11,12,13]. Moreover, these tribe-caste 
genetic differences have been attributed to the establishment of the 
Hindu Varna system that has been maintained for millennia since 
both Y chromosome and caste designation are paternally 
inherited. However, the origin of caste system in India is still a 
controversial subject [8,14,15,16], and there are two main schools 
of thought about it. First, demic diffusion models propose an 
expansion of Indo-European (IE) speakers 3 Kya (thousand years 
ago) from Central Asia [10,17,18,19,20,21,22]. Alternatively, 
other models propose the origin of caste as the result of cultural 
diffusion and/or autochthonous demographic processes without 
any major genetic influx from outside India [6,7,16,23]. Overall, 
the genetic impact and mode of establishment of the caste system, 
the extent of a common indigenous Pleistocene (10 Kya to 
30 Kya) genetic heritage and the degree of admixture from West 
Eurasian Holocene (10 Kya) migrations and their level of impact 
on the tribal and non-tribal groups from India, remain unresolved 
[5,6,7,10, 16]. 

The lack of consensus among previous studies may reflect 
difficulties associated with the conflicting relationships between 
genetics and the socio-cultural factors used to pool truly 
endogamous groups into broader categories, sometimes grouping 
Indian populations sampled from a wide variety of geographical 
locations together, such as a tribe-caste dichotomy or caste-rank 
hierarchy [2,5,7]. One goal of pooling data from multiple 
populations has been to smooth individual drift effects in an effort 
to reconstruct putative ancestry [10] and thereby potentially infer 
the past demographic processes shaping genetic diversity. How- 
ever, the success of this approach relies on whether the 
classification employed indeed reflects the true historical relation- 
ships among these endogamous groups. Methods seeking to 
identify the best grouping from an exploration of alternative 
possible classifications, based on seeking maximal between- 
population differences and minimal within-population variation 
[24], would be of special relevance for studies on Indian 
populations classified based on Varna status. This is the case 
because several castes have suffered from historically fluid 
definitions of their rank status, and both the origins and the scope 
of the genetic impact of the Varna system on these populations are 
still unclear [8,20,25,26,27,28]. Further, since the implementation 
of the Varna system throughout India was not a uniform process 
[17], broad classifications of multiple Indian samples from all over 
the subcontinent based on Varna status, or tribe-caste dichotomy, 
may not reflect true endogamous populations and could also 
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obscure genetic signals and the finer details of Indian demographic 
histories. For this reason, a genetic study using a careful and 
extensive sampling of well-defined non-tribal and tribal endoga- 
mous populations from a restricted area designed to reduce the 
confounding relationships among socio-cultural factors, without 
presuming Varna rank status, to find empirically the best 
approach of population grouping, could be a successful model to 
obtain new insights of past Indian demographic processes. 

Here, we attempted to apply this strategy to unravel the 
population structure and genetic history of the southernmost state 
of India, ‘Tamil Nadu (TN), which is well known for its rigid caste 
system [15], and to relate the resulting genetic data to the 
paleoclimatic, archaeological, and historical evidence from this 
region. The paleoclimatic and archaeological records show post- 
LGM (Last Glacial Maximum) wet period expansions of foragers 
into the region, whose interactions with later aridification-driven 
migrations of agriculturists have been traced 
[29,30,31,32,33,34,35]. Archaeology also reveals the establish- 
ment of metallurgy [36] and river settlements [17], just several 
centuries prior to the creation of the earliest written records of the 
Sangam literature (300 BCE to 300 CE). These historical records 
named several populations including some in the present study 
(e.g., Paliyan, Pulayar, Valayar) reflecting the existence of these 
now endogamous groups at that time [37,38]. More recent reports 
dated to the 6"" century CE, under the reign of the Sarabhapur- 
tyas, [39] illustrate the local implementation of the Varna system 
around | Kya, following the arrival of Brahmins into the region 
[15,17]. The Tamil epics of this period, such as the Purananuru 
anthology and Silapathikaram, describe a society with a well- 
defined occupational class structure based on subsistence practices 
[22]. Earlier genetic studies of TN populations identified clear 
differentiations of endogamous ethnic groups classified into Major 
Population Groups (MPG) based on socio-cultural characteristics 
reflecting subsistence, traditional occupation, and native language 
(mother tongue) [40,41]. Although some studies have identified hill 
tribes as the earliest settlers, and others suggested a common 
genetic signature among distantly ranked-caste populations, the 
main evolutionary and demographic processes shaping the 
observed genetic differences among populations from TN are still 
unresolved in the literature [15,42,43,44]. 

In the present study, we examined the Y-chromosomal lineages 
of 1,680 individuals sampled from 12 tribal and 19 non-tribal well- 
defined endogamous populations. We first investigated whether 
tribal and non-tribal groups shared a common genetic heritage 
and characterized the proportion of putatively autochthonous and 
non-autochthonous Indian Y-chromosomal haplogroups. It is 
important to note that the total sample size used here is higher 
than those in other studies covering the entire Indian subconti- 
nent. Further, the detailed anthropological annotation of endog- 
amous populations sampled from a restricted region within India, 
together with the paleoclimatic, archeological and_ historical 
regional-background were all important aspects needed to reduce 
the confounding relationships among socio-cultural factors. This 
general approach allowed us to infer important genetic signals and 
the finer details of the population demographic _ histories. 
Therefore, we sought to determine which of the classifications 
based either on the Varna system (rank status, tribe-caste 
dichotomy), or social-cultural factors (reflecting subsistence, 
traditional customs and native language), or geography better 
indicated true endogamous groups by exhibiting higher between- 
population differences and lower within-population variation. 
Since both Y chromosome and caste designation are paternally 
inherited, we further explored whether any of these genetic 
differences could be attributed to the historical evidences of the 


November 2012 | Volume 7 | Issue 11 | e50269 


establishment of the Hindu Varna system. In contrast, we found 
the overall Y-chromosomal patterns, the time depth of population 
diversifications and the period of differentiation correlated better 
with archeological evidences and the demographic processes of 
Neolithic agricultural expansions into the region. 


Materials and Methods 


Sampling Strategy 

Tamil Nadu, the land of ‘Tamils (Tamil has the most ancient 
literary tradition of all Dravidian languages), is the southeastern 
most province of India, measuring 130,058 km? with a population 
of 62,405,679 (2001 Indian Census: http/www.censusindia.gov. 
in), the majority living in 17,272 villages. We sampled a total of 
1,680 men, avoiding relatives to the third degree, from 12 tribal 
and 19 non-tribal endogamous populations, which were selected 
for their cultural uniqueness, geographical spread, and ethno- 
graphic features. Samples from tribal participants were collected in 
their isolated native villages and settlements from the tropical 
forests of Western Ghats on the west side of TN. In contrast, non- 
tribal populations exhibit a larger census sizes and geographical 
spread and they were sampled in colleges and community 
gatherings, covering 8% of the total villages from TN (see 
Figure 1 for sampling locations). ‘The institutional Ethical 
Committees of Madurai Kamaraj University and the University 


Figure 1. Tamil Nadu map showing the sampling location of 
the 12 tribal (squares) and 19 non-tribal (circles) populations. 
The majority of tribal populations are located in the mountains of the 
Western Ghats. The color codes are: Red - Hill Tribe Foragers (HTF); 
Turquoise — Hill Tribe Cremating (HTC); Green - Hill Tribe Kannada 
(HTK); Grey - Schedule Castes (SC); Pink - Dry-Land Farmers (DLF); Deep 
Blue — Artisan and Warriors (AW) and Yellow —- Brahmin related (BRH). 
Population abbreviations are as shown in Table 1. 
doi:10.1371/journal.pone.0050269.g001 
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of Pennsylvania (USA) approved the protocol and_ ethical 
clearance of the study. The project was explained to the volunteers 
through local contacts or community leaders in their local 
languages and signed informed consent was obtained before 
samples were collected. Permission to utilize pre-existing samples 
from Nilgiri tribes (N=570) was obtained from the relevant 
institution (Nilgiris Adivasi Welfare Association). Further geno- 
typing of 17 Y-STRs and deeper Y-SNPs were performed on 46 
samples of Piramalai Kallar, 40 samples of Sourashtra and 107 
samples of Yadhava used in a previous study [19]. 

While many previous Indian population studies aimed to 
elucidate the main processes involved in the genesis of the social 
stratification by pooling populations into broad classifications such 
as caste-tribe dichotomy and social hierarchy [6,13,45,46], we 
sought to explore whether alternative classifications could better 
reflect the relationships among the true endogamous groups by 
increasing between-population differences and reducing within- 
population variation [24]. We considered a partition of the 31 
endogamous populations into seven Major Population Groups 
(MPG) based on_ socio-cultural factors primarily reflecting 
subsistence, traditional customs and_ native language 
[47,48,49,50], which we contrasted with alternative groupings. 
The defining features for these MPGs were the following: (1) “Hill 
Tribe — Foragers’ (HTF), tribal populations sharing a foraging 
mode of subsistence and speaking their own Dravidian (T'amil/ 
Malayalam) dialects; (2) ‘Hill Tribes — Cremating’ (HTC), tribes 
who cremate their dead, an unique socio-cultural feature among 
these tribal populations; (3) ‘Hill Tribes - Kannada-Speakers’ 
(HTK), hunter-gatherer tribes speaking the Kannada (Dravidian) 
languages; (4) ‘Scheduled Castes’, (SC), designated by the Indian 
Government as non-land owning laborers, ranked lowest in the 
Varna system; (5) ‘Dry Land Farmers’ (DLF), populations living by 
dry-land farming subsistence, cultivating crops (millets and grains) 
that do not require wrigation technology; (6) ‘Artisans and 
Warriors’ (AW), populations that are traditionally warriors or 
artisans of various kinds, and; (7) ‘Brahmin Related’ (BRH), 
following the Vedic traditions with a good knowledge on water 
management and wet land irrigation. The populations included in 
each of the seven MPG and their ethnographic notes are given in 
Table 1. Although it may appear that the proxies used for 
grouping the populations mix criteria in non-uniform and 
arbitrary ways, we followed a systematic, step-by-step approach 
to test and validate these classifications by comparing them with 
other groupings employed in the literature. Endogamous popula- 
tions were initially sampled taking caste-tribe and social hierarchy 
into consideration. After considering their ethnographic histories 
in greater detail, we tested whether tribes with common cultural 
features tended to share a similar genetic makeup, and whether 
population groups differentiated better when clustered according 
to socio-cultural factors reflecting thei mode of subsistence, 
traditional customs, and native language. It is portant to stress 
that many of the criteria used in the classification based on the 
seven MPG are in some degree correlated with previous methods 
employed to classify Indian populations (such as_ tribe-caste 
dichotomy, or caste-rank hierarchy). It could be argued that the 
seven MPG method may not be the best possible arrangement 
from the perspective of explaining the entire cultural variation in 
TN. However it captures the observed pattern of genetic variation 
slightly better than any of the previously attempted models (see 
Results Section). Finally, we recognized that there is always a 
degree of arbitrary in all the methods used to classify endogamous 
populations, but all of them are just subtle variations around the 
same theme: economic or mode of subsistence. 
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Y-Chromosomal Analysis 


an = DNAs were extracted from blood or mouth-wash samples using 
a Ss standard methods [19]. Samples were genotyped for single 
& 3 nucleotide polymorphisms (SNPs) with a set of 23 custom TaqMan 
assays (Applied Biosystems) using a 7900HT Fast Real-Time PCR 
#19 System. In addition, 19 Y-chromosomal short tandem repeat 
3 (STR) and 6 SNP loci (Y-filer‘™ and Multiplex IT Kits, ABI) were 
3 genotyped using an ABI 3130XL Gene Analyzer, and fragment 
3 sizes were determined using the GeneMapper Analysis Software 
“, © (v3.2, ABI) as described elsewhere [51]. Genotypes were validated 
aC} R by testing reference samples from Coriell and the Genographic 
£3 & Consortium. The multi-copy markers DYS385a and DYS385b 
8 2) 3 were excluded from further analyses because of ambiguity in 
Uz |o distinguishing these loci. Y chromosome haplogroups (HGs) and 
paragroups were determined according to the 2008 YCC 
nomenclature [52]. 
By |3 ae 
oe le Statistical Analysis 
5 a = The software ARLEQUIN 3.11 [53] was employed to compute 


Ner’s D (Nei 1987) and conduct AMOVA [54] using both Y- 
chromosome HG frequencies and haplotype data. Fisher exact 
tests were carried out among populations and MPGs to identify 
significantly over- or under-represented HGs. Among those over- 
represented HGs that tended to characterize any given MPG, 
Fisher exact tests were further performed on the number of 
populations over-represented in the given HG within the MPG 
versus those outside of the MPG to quantify the significance of 
such associations. Principal Component Analysis (PCA) [55] was 
performed using HG frequencies, centered without variance 
normalization [56] and with the significant components identified 
by employing the skree-plot method [57] using R, version 2.9.1 
(http://www.r-project.org/). The same software was implemented 
to perform non-metric multidimensional scaling (MDS) [58] using 
Rs; distances generated from the 17 Y-STR data of the TN 
populations, using ARLEQUIN. The relative HG age estimates 
were based on the variance of 17 STRs of the most frequent HGs 
for the seven MPG as previously described [51]. 

We considered the problem of how to quantify the significance 
of the difference between specific population group structures. 
AMOVA’s resampling scheme compares individual group struc- 
tures to the whole ensemble of randomly varied assignments of 
populations to groups, as well as of samples to populations. This 
tests the hypothesis that a specific group structure represents 
organization of the genetics among populations better than would 
be expected by chance. In our case, we had the different problem 
of testing whether one group structure was significantly better than 
another group structure. In this case, assignments were already 
determined, and likely are both already better than expected by 
chance. The question we tested was whether that variation in data 
randomly drawn from a population could have produced sufficient 
variation in the AMOVA results to account for the differences 
between the specific group assignments being compared by 
chance? Hence we resampled the STR haplotypes with replace- 
ment, modeled by a multinomial distribution, and computed the 
median and 95%CI’s of the results using R, version 2.9.1. We 
tested resampling sizes up to 5,000 times, and found that 500 were 
sufficient to give reasonable accuracy on the median and 
confidence interval estimates. We therefore resampled each 
configuration only 500 times. 

The phylogenetic relationships among Y-STR_ haplotypes 
drawn from individual haplogroups were estimated with the 
reduced-median (RM) network algorithm in the program Network 
4.5.0 [59,60], applying weights inverse to averaged haplotype 
variance and reduced median reduction coefficient set at 1.0. This 


Code" 
M 


Mode of Subsistence 
Wet Land Agriculture/Priests 


Social 
Rank’ 
High 


Native Language 


Linguistic 
Sanskrit! 


Population Name _ Family 
IE 


Vadama 


Code" 
VDM 


"Sanskrit is the language of scriptures and ceremonies, but populations quickly adopted local cultures and languages. 


J_Lower, Middle & Higher social ranks are self-perceived/assigned classifications. 


*. 2001 Census, Government of India, http: www.censusindia.gov.in. 
©1981 Indian Census. 
Approximate coordinates. 


“1931 Indian Census. 
*-Population code used in PCA & MDS plots, 


f All Brahmin-related castes in Tamil Nadu, 
NTN (North Tamil Nadu), TNV (Tirunelveli). 
DR (Dravidian), IE (Indo-European). 


3-No information available. 
doi:10.1371/journal.pone.0050269.t001 


Table 1. Cont. 
Major Group 

4. Estimated census size. 
£-1901 Indian Census. 
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program creates a tree topology based on the interrelationships of 
the emergence and transmission of mutations in the respective 
haplotypes. Even under simplifying conditions, the construction of 
this simple combinatorial structure is algorithmically difficult, and 
diverse algorithms give different answers. This result can be 
informative if some subset of the results is consistent among 
models. ‘Therefore, in addition to using Network for assessing the 
phylogenetic relationships of Y-STR haplotypes, we also used 
ULTRANET _ (http://www.dei-unipd.it/ ~ciompin/main/Sito/ 
Ultranet.html), where the underlying distance (metric) between 
nodes is ultrametric. Since tree structures reflect an ultrametric 
structure, an algorithm that maps the compatibility of associations 
according to such a structure may be uniquely informative. ‘This 
approach, which is orthogonal to other phylogenetic approaches, 
helped confirm the results observed in RM network analysis, 
thereby validating the consistency of the population associations 
with evolutionarily related haplotypes. 

Coalescence methods, as implemented in BATWING [61], 
were applied to several different subsets of populations to quantify 
major underlying demographic events, estimate divergence times 
and assess the phylogenetic relationships among 'I'N populations. 
One of the major characteristics of BAT'WING is that the trees it 
produces are constructed on the assumption of no gene flow 
among demes. The proportions of samples the Metropolis- 
Hastings algorithm provides in each tree gives some sense of the 
strength of that candidate tree in representing the data. These 
estimates account for the impact of mutation histories through the 
likelihood scores obtained over the distributions of priors for 
mutation rates and other demographic parameters. The outcome 
of these estimates is that modal, and near modal, trees will show a 
somewhat filtered view of the genetics contributing to the most 
likely trees observed. Given these considerations, BATTWING is 
expected a priori to be appropriate for testing whether major 
population differentiation occurred before or after the Varna 
system was historically established in TN, under the assumption of 
restricted admixture among populations under this social organi- 
zation and structured endogamous system. The various testing 
procedures described above, including MDS, PCA, the AMOVA 
tests for differentiation, and the Fisher tests, were further applied 
to establish whether there was a signal for common gene pools 
among populations, as required for typical BATWING analyses. 

In addition, BATWING admixture validation tests [62] of the 
TN data were applied under three simulated potential scenarios. 
In the first scenario, an individual population (Pantya) was 
randomly split, and the BATWING analysis of the population 
split time was performed. BATWING generally produced a 
median time of less than 500 years, with the 95% confidence 
intervals (CI) covering only the last two generations. In the second 
scenario, recent gene flow was modeled between two populations 
(Paniya and Brahacharanam) estimated by BATWING to have 
already been isolated for a significant time (19.5 Kya) by randomly 
mixing different proportions of chromosomes from each popula- 
tion. BATWING gave much younger population divergence 
estimates (9.3 Kya) than the unmixed split, even with only 5% of 
the Y-chromosomes mixed randomly between the two popula- 
tions, with a 10% mix between populations decreased the 
divergence time estimates by more than 50% (3 Kya). In the 
third scenario, we explored the impact of BATWING estimates by 
randomly introducing an in-migrating population (Pantya) carry- 
ing new paternal lineages into two differentiated demes (Braha- 
charanam and Kota: split time was estimated at 4.7 Kya). These 
estimates were only slightly affected (the split time actually 
appeared to increase to 6.2 Kya) when the in-migrating propor- 
tion did not exceed more than 40-50%. At that point, the modal 
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trees were dominated by the in-migrating population. Overall, the 
results of the BATWING admixture tests based on data from the 
TN populations were similar to those observed in a study of 
religious populations within Lebanon [62]. ‘Therefore, BATWING 
generally seems to show little sensitivity to gene flow from 
immigrants bringing new paternal lineages (different HGs) into the 
parent population, but is very sensitive to gene flow between 
populations sharing paternal lineages from the same HGs. 

Besides assuming no gene flow, BATWING presupposes that 
the population samples are random. As a result, usmg BATWING 
to analyze the histories of individual HGs drawn from populations 
yields dramatically different estimates of coalescence times, times 
of expansion, and other population parameters because, as 
mentioned in the admixture modeling, BATWING is more 
sensitive to admixture than in-migration. ‘Thus, BATWING may 
be applied to individual HGs to extract information about specific 
in-migration events. Further, HGs that tend to correlate strongly 
with overall population estimates are likely to be more represen- 
tative of their common ancestral gene pool. ‘These results may be 
expected in that selection of the modal population trees will tend 
to preserve configurations where the most common of the shared 
lineages comprise the strongest signals contributing to the 
likelihood function. Therefore, selection of modal trees acts as a 
filter that tends to exclude immigrating contributions, although it 
will be heavily influenced by inter-population migration. 

In these BATWING estimates, mutation rate priors were those 
previously proposed [63] based on the effective mutation rates 
previously cited [64]. Between 1.5 and 3.5 million Monte Carlo 
(MC) samples were collected, generally accepting equilibration 
following 500,000 MC samples and being determined by decay to 
equilibrium of global estimates of effective population size and 
relative constancy of quantile measurements extracted from the 
equilibrated regions. ‘Times associated with clusters identified by 
RM networks as indicating evolution within populations were 
estimated using UEPtmin and UEPtmax estimates within BATW- 
ING. When computing population splits, large numbers of 
populations tend to produce cross-talk between bifurcations on 
different branches. A way to resolve this cross-talk is to set up 
multiple runs with the various branches pooled except for the 
primary branch under consideration. This approach also provides 
an opportunity to check the consistency of split times of the parent 
branches common to the pooled topologies. Composite trees may 
then be constructed from the results of the multiple runs. SNPs 
selected as unique evolutionary polymorphisms (UEPs) in compu- 
tations of population split times depended on the representation of 
variation through each of the populations being considered, or 
through the pooled populations for UEP time estimates. 


Results 


NRY landscape of Tamil Nadu reveals predominantly 
autochthonous lineages 

A total of 21 Y chromosome HGs were identified in the study 
populations (Table 2). ‘The overall HG diversity among popula- 
tions was 0.886+0.003; of these, tribal populations exhibited lower 
diversity (0.796+0.013) than —non-tribal_ ~—_— populations 
(0.881+0.004). The majority of this genetic variation (82%) was 
accounted for by seven HGs: H1-M52 (17.4%), F*-M89 (16.3%), 
L1-M27 (14.0%), Rlal-M17 (12.7%), J2-M172 (9.4%), R2-M124 
(8.2%) and H-M69 (4.7%). It should be noted that 90% of the C- 
M130 samples reported here (66 out of 74) were positive for C5- 
M356 while the rest were negative for both C3-M217 and C5- 
M356 (Table S1). 
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BRH-Brahmins 
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0.00 0.00 0.00 0.00 25.00 0.00 0.00 0.00 


0.00 0.00 0.00 0.00 9.52 0.00 9.52 0.00 
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40 


Sourashtra 


19.05 33.33 4.76 0,848 (0.054 


0.00 0.00 9.09 0.00 0.00 0.00 0.00 0.00 0.00 36.36 0.00 0.818 (0.083 


0.00 0.00 000 4.76 0.00 0.00 4.76 0.00 
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Brahacharanam 


0.00 0.00 27.27 0.00 9.09 0.00 0.00 0.00 


0.00 
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11 


lyengar 


159 14.29 1.59 3.17 0.00 0.00 635 47.62 0.00 0.746 (0.052 


0.00 0.00 


1.59 4.76 0.00 7.94 0.00 3.17 0.00 
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Vadama 


1.48 0.74 0.00 5.93 42.22 2.22 0.779 (0.030) 6.67 


1.48 13.33 0.74 


0.00 0.00 


0.06 


13.33 0.00 2.96 0.00 


0.00 0.74 444 0.00 
16.25 3.10 4.70 


135 3.70 
44 


BRH Total 
1680 


0.886 (0.003 


12.74 8.21 


13.99 1.13 083 036 1.55 2.02 


1.19 0.77 


149 0.12 9.35 


17.38 0.06 


0.3 


31 populations TOTAL 


SD (Standard Deviation). 


doi:10.1371/journal.pone.0050269.t002 


Genetic Structure of Southern Indian Populations 


The geographical origins of many of these HGs are still debated. 
However, the associated high frequencies and haplotype variances 
of HGs H-M69, F*-M89, Rlal-M17, L1-M27, R2-M124 and C5- 
M356 within India, have been interpreted as evidence of an 
autochthonous origins of these lineages during late Pleistocene 
(10-30 Kya), while the lower frequency within the subcontinent of 
J2-M172, E-M96, G-M201 and L3-M357 are viewed as reflecting 
probable gene flow introduced from West Eurasian Holocene 
migrations in the last 10 Kya [6,7,16,23]. Assuming these 
geographical origins of the HGs to be the most likely ones, the 
putatively autochthonous lineages accounted for 81.4+0.95% of 
the total genetic composition of ‘TN populations in the present 
study. These results are concordant with earlier studies based on 
autosomal markers and haploid loci in suggesting lower gene flow 
from West and Central Asia to south India compared to north 
India [5,11,23,65]. Additionally, our results indicate a potentially 
differential genetic impact of these migrations on tribal versus non- 
tribal groups. For example, the proportion of non-autochthonous 
Indian lineages was found to be significant higher (6<0.0001) 
among non-tribal populations (13.7£1.03%) than among the 
tribal populations (7.4+1.09%). In contrast, the proportion of 
likely autochthonous lineages among the tribal populations 
(87.7£1.37%) was significant higher (Fisher test: <0.0001) than 
in non-tribal populations (78.11.24%). 


Genetic structure of Tamil Nadu populations is best 
correlated with subsistence practices 

AMOVA using both HGs and STR distances (R57) was applied 
to several different models of population differentiation to assess 
the proportion of genetic variation explained by geography, tribe- 
caste dichotomy, caste-rank hierarchy, and other socio-cultural 
factors reflecting subsistence practices (Table 3, Table $2). The 
highest genetic variation among classifications involving all 
populations (F¢7= 0.065; among resampled data, median = 0.064, 
95%CI :0.052—-0.078) and the lowest variation within groups 
(Psc= 0.040; median = 0.062; 0.05—0.074) were observed when 
populations were classified into the seven MPGs based on 
subsistence. Further analyses considering only the four non-tribal 
groups revealed a four-fold decrease in genetic variation among 
groups (Fer= 0.015; median = 0.014; 0.003—0.026) when com- 
pared to the three tribal groups alone (fgr= 0.095; medi- 
an= 0.095; 0.066—-0.129). Moreover, the exclusion of HTF 
reduced the between-group variance by more than two-fold 
(6.5% to 2.7%), while exclusion of HTK and BRH had little 
impact. On the other hand, the exclusion of BRH from non-tribal 
groups reduced the between-group variation threefold (1.5% to 
0.4%). 

To determine if the number of groups taken into consideration 
had a significant impact on the #7 values obtained, we compared 
the mean and 95% CI of the null distribution of Va (among group 
variance, data not presented) that is used to estimate the For 
index. It is logical that the Va null distribution would vary with 
different groupings if the relative impact of groups is high. 
Contrary to this, we found that the mean and the standard 
deviations of the null distribution did not vary much among 
groupings (Table 3) hence suggesting that the number of groups 
taken in to consideration did not have much impact on the Foy 
estimates. Further, the 95% CI intervals of the AMOVA estimates 
computed by re-sampling 500 haplotypes with replacement across 
populations showed that 95% CI of 7-MPG classification was 
significantly higher from that of grouping by geography or Varna 
rank status (Table 82). 

The PCA and MDS analyses of HG frequencies and Rg 
distances reflected the AMOVA results (Figures 2a, 2b). In the 
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Table 3. Analysis of molecular variance (AMOVA). 


Genetic Structure of Southern Indian Populations 


Populations Grouping No of groups 


Among groups (Fct) 


Among populations 


within groups (Fsc) Within populations (Fst) 


SNPs STRs SNP? STR? SNP? STR? 
All 31 populations 1 0.103 0.093 
Geography 9 0.025° 0.035" 0.083 0.063 0.106 0.096 
Socio-Cultural Factors 
7 Major Populations Groups (MPG) 7 0.082? 0.065° 0.036 0.040 0.114 0.102 
HTF excluded 6 0.035° 0.026° 0.027 0.034 0.061 0.060 
BRH excluded 6 0.077? 0.059° 0.037 0.042 0.111 0.099 
HTK excluded 6 0.082° 0.062° 0.031 0.039 0.111 0.099 
Caste vs Tribe 2 0.075° 0.062" 0.069 0.065 0.139 0.124 
TR-UP-MID-LOW 4 0.057° 0.047° 0.065 0.063 0.119 0.107 
Tribes Only 
HTF-HTK-HTC 3 0.1106 0.095° 0.081 0.079 0.182 0.167 
Non-tribes (Castes) Only 
UP-MID-LOW 3 0.019" 0.015" 0.024 0.030 0.042 0.044 
SC-DLF-AW-BRH 4 0.023° 0.015" 0.017 0.026 0.039 0.041 
SC -DLF-AW 3 0.0096 0.0044 0.016 0.027 0.025 0.031 
* P<0.00001. 
 P<0.001. 
© P<0.01. 


4 No Significant, P<0.2. 


TR (Tribes), HTF (Hill Tribe Foragers), BRH (Brahmins), HTK (Hill Tribe Kannada speakers), SC (Schedule Castes), DLF (Dry Land Farmers), AW (Artisan & Warriors). 

HG, MID, LOW - High, Middle and Low caste-rank hierarchy as described in Table 1. 

Endogamous populations were grouped based on geography, tribe-caste dichotomy, caste-rank hierarchy, and socio-cultural features mainly reflecting subsistence (7 
Major Population Groups, MPG). The maximal genetic variation among groups (Fc) and the minimal variation among populations within groups (Fsc) was observed 


when populations were grouped based on the 7 MPG classification. 
doi:10.1371/journal.pone.0050269.t003 


PCA analysis the first two components accounted for 38.86% 
variance, while in the MDS analysis a stress value of 15.6% was 
obtained when the objects were clustered in two dimensions. ‘This 
stress value is significant in the light of the work of Sturrock and 
Rocha, 2000 [66]. In both plots, two tribal (HTF, HTK) and the 
non-tribal Brahmin (BRH) groups formed distinct and distant 
clusters, while the rest were interspersed in their midst. 

Interestingly, the same tribal groups showed greater genetic 
similarities to other Dravidian tribes from the southern states of 
Andhra Pradesh and Orissa, and TN BRH clustered with IE 
speaking populations from multiple regions, when the present data 
set was compared with 97 populations from India and neighboring 
regions by PCA (Figure S1, Table $3). The historical migrations of 
BRH into TN and the long-term isolation for some Dravidian 
tribal groups already reported in previous studies [15,17,25] could 
potentially explain why HTF, HTK and BRH groups exhibited 
greater genetic similarities with those culturally related populations 
outside of TIN. Taken together; the PCA, MDS and AMOVA 
results all indicate strong genetic structure among 'I'N populations. 
They further suggest that the MPG classification based on socio- 
cultural factors reflecting subsistence better reproduces true 
endogamous groups by increasing between-population differences 
and reducing within-population variation. 


Non-homogenous HG distributions among constituent 
populations of MPGs 


Fisher exact tests indicated that various HGs were significantly 
predominant in one or another MPG (Table $4). The highest 
frequency of F-M89 (53.3%) was observed among HTF 
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(p<0.0001), while H1-M52 showed the highest frequency 
(42.5%) in HTK (p<0.0001). Among the non-tribal groups, 
BRH showed 42.2% of Rlal-M17 (p<0.0001), and L1-M27 
appeared at a higher frequency (24.1%; p<0.0001) among DLF. 
However, wide variation in HG frequency and composition was 
observed among the populations included in each of these MPGs 
(Table 2). For example, the proportion of F*-M89 in HTF ranged 
from 75% to 28.6% among the constituent populations. A similar 
pattern was observed in other MPGs characterized by H1-M52 in 
HTK and LI-M27 in DLF. Thus, not all the constituent 
endogamous populations in a MPG shared a similar genetic 
makeup, indicating the differential influence of evolutionary forces 
such as drift, fragmentation, long-term isolation or admixture. 

In addition, Fisher exact tests were used to determine the 
probability of observing multiple populations within an MPG 
sharing the same over- or under-represented HGs by chance (e.g., 
random demic assimilation into a MPG from already differenti- 
ated endogamous populations) or because of the systemic 
inheritance of ancestral lineages among the constituting popula- 
tions of MPGs. Our results rejected the hypothesis that random 
processes could have caused the significant over-representation of 
F*-M89 in HTF+HTK populations (p<0.0001), L1-M27 in DLF 
populations (p<0.001), H1-M52 in HTK populations (p<0.0001), 
and Rlal-M17 in BRH_ populations (#=0.001). Likewise, 
significant results were obtained for under-representation of F*- 
M89 in all BRH populations (#=0.043), LI-M27 in HTF 
populations (#=0.02) and Rlal-M17 in HTF populations 
(p= 0.003). Together, these results argue for the distinctiveness 
of the ancestral gene pools for MPGs and the shared heritage of 
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Figure 2. Plots representing the genetic relationships among 
the 31 tribal and non-tribal populations of Tamil Nadu. (A) PCA 
plot based on HG frequencies. The two dimensions display 36% of the 
total variance. The contribution of the first four HGs is superimposed as 
grey component loading vectors: the HTF populations clustered in the 
direction of the F-M89 vector, HTK in the H1-M52 vector, BRH in the 
R1a1-M17 vector, while the HG L1-M27 is less significant in 
discriminating populations. (B) MDS plot based on 17 microsatellite 
loci R,, distances. The two tribal groups (HTF and HTK) are clustered at 
the left side of the plot while BRH form a distant cluster at the opposite 
side. The colors and symbols are the same as shown in Figure 1, while 
population abbreviations are as shown in Table 1. 
doi:10.1371/journal.pone.0050269.g002 


these paternal lineages among populations within MPGs, in spite 
of their non-homogenous distribution. Further, the over-repre- 
sented HGs marking MPGs explains in part some of the 
organization observed in the PCA and MDS results, and also 
yields insight into the differentiations noted in the AMOVA 
results. 
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Genetic Structure of Southern Indian Populations 


IHG F*- M89 


Figure 3. Reduced median network of 17 microsatellite 
haplotypes within haplogroup F-M8g9. The network depicts clear 
isolated evolution among HTF populations with a few shared 
haplotypes between Kurumba (HTK) and Irula (HTF) populations. Circles 
are colored based on the 7 Major Population Groups as shown in 
Figure 1, and the area is proportional to the frequency of the sampled 
haplotypes. Branch lengths between circles are proportional to the 
number of mutations separating haplotypes. 
doi:10.1371/journal.pone.0050269.g003 


Reduced median network analysis identifies strong 


founder effects among tribal populations 

RM networks were constructed to evaluate HG diversification 
within TN populations. Here, low-reticulated networks with 
branches showing segregation by population were expected if 
strong founder effects had shaped variation in paternal lineages, 
particularly in the HGs overrepresented in MPGs. By contrast, 
reticulated networks exhibiting shared STR haplotypes between 
populations from different MPGs would indicate that contempo- 
rary populations were derived from descendants drawn from 
differing sources carrying disparate and diverse STR haplotypes, 
suggesting potential admixture among populations. Long branches 
with multiple unoccupied steps (internodes) connecting constituent 
haplotypes would suggest strong genetic drift or possibly sporadic 
intrusion from a genetically distinct source. 

F*-M89 was the only HG showing clear population-specific 
clusters (Paniya, Paliyan and Irula of HTF) suggesting long-term 
isolation (Figure 3). In contrast, all other RM networks did not 
show any population-specific clusters and were reticulated with 
long branches having multiple internodes (Figure S2a to S2e). 
Overall, these results suggest that both genetic drift (possibly due to 
founder effects) and admixture may be a common feature of the 
studied populations. The combination of low segregation among 
RM networks and higher diversity may result from a period of 
assimilation of diverse sources into a larger common gene pool 
from which the modern populations were subsequently drawn. 


HG age estimates are older in non-tribal groups 

Tribes are generally considered as the descendants of the early 
settlers of India and, therefore, better depict the autochthonous 
genetic composition of India than non-tribal populations 
[2,12,15,67]. Association between high frequency and high STR 
variance of a HG in a population are potential indicators of long- 
term in-situ diversification. These may also indicate the likely 
source of the HG in other populations. We therefore investigated 
whether tribal populations possess older genetic lineages, and 
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Figure 4. Modal tree obtained by BATWING indicating the 
coalescence time divergence estimates (in years) among Major 
Populations Groups (MPG) after using 17 STRs from all 
haplogroups. BATWING estimates suggest that all populations groups 
started to diverge 7.1 Kya (95% Cl: 5.5-9.2 Kya), with limited admixture 
among them for the last 3.0 Kya (2.3-4.3 Kya), the youngest diverge 
time estimate. The modal tree shows two differentiated nodes with 
clear overlapping estimates of the splits: a first node including one of 
the tribal groups (HTC) together with all the non-tribal MPGs (castes) 
with a divergence time of 6.2 Kya (4.7-8.0 Kya), while the second node 
embraces the HTF and HTK tribal groups with an estimated divergence 
between then of 4.9 Kya (3.6-7.1 Kya). 

doi:10.1371/journal.pone.0050269.g004 


could thus be the potential sources of these lineages for other 
populations, by computing HG age estimates based on Y-STR 
variances (Table 4). The age estimates for all HGs exceeded 10— 
15 Kya with overlapping confidence intervals among MPGs. 
Further, MPG exhibiting high frequencies of specific HGs did not 
show the oldest age estimates. Interestingly, non-tribal groups 
exhibited older age estimates than tribal groups for all HGs, 
excepting R2-M124. These results indicated that tribal and non- 
tribal populations share a genetic heritage dating back to at least 
the late Pleistocene (10-30 Kya). The HG age estimates presented 
here are similar to those generated for the same HGs in earlier 
studies involving a similar or lesser number of samples taken from 
a broader geographic region of India [7,23]. 


BATWING estimates of genetic affinity and ancestry 

We configured several BATWING runs using different subsets 
of data to estimate the dates of population differentiation and 
explore the different demographic processes and affinities among 
the MPGs and their constituent populations. The first set of 
BATWING runs analyzed haplotypes from all HGs among all of 
the MPGs to investigate whether tribal and non-tribal MPGs have 
an independent origin or instead descended from a common 
ancestral gene pool. If tribal and non-tribal groups have 
independent origins, then it would be expected that population 
tree bifurcations marking the differentiation of these two groupings 
would exhibit very old divergence time estimates and non- 
overlapping confidence intervals (CIs). Figure 4 represents the 
modal tree obtained for this BATWING run. It shows that 
populations begin to diverge around 7.1 Kya (95% CI: 5.5- 
9.2 Kya), and contams two differentiated nodes with clear 
overlapping estimates of the splits. The first node separated the 
HTF and HTK tribal groups from the rest of the MPGs, with an 
estimated divergence time of 4.9 Kya (3.6-7.1 Kya), while the 
second included the other tribal group (HTC) and the non-tribal 
MPGs, with a divergence time of 6.2 Kya (4.7-8.0 Kya). These 
BATWING estimates suggest that all MPGs started to diverge 
during the same span of time with very limited admixture among 
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them, at least for the last 3 Kya (2.3-4.3 Kya), the youngest time 
estimate. 

The second set of BATWING runs included only haplotypes 
from one of the most common HGs among MPGs. In this regard, 
we would like to emphasize that BATWING results using 
haplotypes from only one HG cannot be interpreted as population 
divergence times, but rather reflect the demographic histories of 
the specific paternal lineage among populations. Also, deviations 
from population estimates among the different runs could reflect 
in-migrations (gene flow) involving a particular HG rather than 
multiple paternal lineages obtained from assimilation from a 
common ancestral gene pool. For these reasons, we explored 
whether the paternal lineages for each HG originated from the 
MPG that exhibits the highest frequency of this HG as a way to 
identify sources and recipients of these Y-chromosomes. In 
addition, similar splitting patterns obtained for the different HG 
trees could be interpreted as demonstrating that the paternal 
lineages entered into the general gene pool from the same 
demographic event. BATWING constructed clear modal trees for 
three HGs (F*-M89, L1-M27 and H1-M52) but not for the others 
(Rlal-M17, H-M69, J2-M172 and R2-M124). The three modal 
trees (Figure S3a—S3c) exhibited very diverse branching patterns 
with tribal and non-tribal MPGs being mixed randomly and 
without the outgroups corresponding to the MPG with the highest 
HG frequency, as would be expected if this MPG were the main 
source of this paternal lineage for other populations. Estimates of 
the time to most recent ancestor (TMRCA) for the HGs ranged 
from 11.4 Kya for F*-M89 to 6.1 Kya for L1-M27. Similar dates 
marking the founding of the clusters identified in the HG F*-M89 
network with Ultranet clustering were obtained by BATWING 
using virtual UEPs to define clusters. The similar TMRCA 
estimates and the diverse tree topologies suggest that extant tribal 
and non-tribal groups derive from the ancient populations of the 
region, with population differentiation taking place at relatively 
similar times under complex demographic histories with multiple 
entries and sources of the common paternal lineages. 

Finally, a third set of BATWING runs were performed using all 
HGs from individual populations within selected MPGs to test 
whether the grouping of these populations could have affected 
BATWING estimates of population divergence and phylogenetic 
relationships (Figure S4a—S4c). All endogamous populations 
grouped according to their MPG classification in the BATWING 
trees with the exception of the HTF-Irula clustering with other 
HTK tribes. This result was not unexpected because the Irula and 
the Kurumba were seen to share STR haplotypes in the F*-M89 
and H*-M69 networks. BATWING estimated the differentiation 
between them to have occurred 3.4 Kya. In addition, BATWING 
assigned similar time frames to those in the previous two sets of 
runs, when major differentiation may have occurred among the 
endogamous populations, independently of the selected popula- 
tions used. Moreover, the two most recent split estimates obtained 
by BATWING runs using endogamous DLF populations agrees 
with historical records, which indicate recent demographic 
expansions for the Vanniyars (2.3 Kya) and Nadars (1 Kya). 
These results further supported the classification of the seven 
MPGs, for which the population divergence time estimates were 
consistent for all sets of BATWING runs. 


Discussion 


The study populations from Tamil Nadu were characterized by 
an overwhelming proportion of Y-chromosomal lineages that 
likely originated within India, suggesting a low genetic influence 
from western Eurasian migrations in the last 10 Kya. Although 
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non-tribal groups exhibited a slightly higher proportion of non- 
autochthonous lineages than tribal populations, the common 
paternal lineages shared by T'N populations are likely drawn from 
the same ancestral genetic pool that emerged in the late 
Pleistocene and early Holocene. We also noted that the current 
modes of subsistence have shaped the genetic structure of TN 
groups, with non-tribal populations being more genetically 
homogeneous than tribal populations likely due to differential 
levels of genetic isolation among them. Coalescence methods, 
employed to identify specific and distinctive periods when genetic 
differentiation among populations occurred, indicated a time scale 
of ~6,000 years. We discuss below whether the timing of the male 
genetic differentiation of the populations fits better with arche- 
ological and historical records for the implementation of the 
Hindu Varna system or with agriculture expansions in the ‘TN 
region. 


Endogamous social stratification preexisted the Varna 
system 

Previous studies of Indian populations have grouped and 
analyzed the genetic data in the light of the Hindu Varna system 
[14,15,16] even though its origin and antiquity are still an ongoing 
topic of debate. One of the theories that has acquired wide support 
relates the establishment of the caste system to Indo-Aryan 
expansions from Western Eurasia into India around 3 Kya. An 
alternative view would see an earlier Indo-Aryan expansion with 
an introduction of cereal farming into Pakistan/North India 
around 8~7 kya. Genetic evidence reported by other studies that 
support these theories are mainly based on the high frequency of 
HG Rlal-M17 in Brahmin castes and their closer genetic affinity 
with West Eurasian populations compared to other Indian non- 
Brahmin castes and tribes [10,20]. However, admixture analyses 
supporting a West Eurasian origin of the Brahmin may be biased 
due to the high frequencies of Rlal-M17 shared between these 
populations, as the rest of their Y-chromosomal variation shows 
little similarity [6,7,16]. Moreover, the recent discovery of new 
markers within Rlal-M17 has allowed Eastern European Y- 
lineages to be differentiated from those in Central/South Asia, 
locating the oldest expansion times with this lineage in Indus 
Valley populations, suggesting an earlier, possibly autochthonous 
origin of this HG in South Asia [68]. The Brahmin populations in 
the present study are also characterized by a significantly higher 
frequency of Rlal-M17 relative to other T'N groups, but without 
any significant frequencies for HGs having a likely origin outside 
India. The TN Brahmin populations also present a very similar 
package of the most common HGs observed in 600 Brahmin 
individuals from all over India [16]. We noted that the highest 
STR variances for HG Rlal-M17 observed in SC and DLF, 
along with the lack of population-specific clusters in the Rlal- 
M17 network and the failure of BATWING to generate a 
definitive modal tree for this HG, all argue against the 
introduction of these paternal haplotypes through a single wave 
of Brahmin (i.e. Indo-Aryan) migration into the region. 

Literary works from the Sangam period (300 BCE to 300 CE) 
describes a heterogeneous society that predates incorporation of 
already established populations into the Hindu Varna system [22] 
in TN. Ancient Tamil society was highly structured by habitat and 
occupation, where endogamy was practiced among populations 
known as kudi [37]. Many of the populations, such as the Valayar 
(meaning net weavers), Pulayar, Paliyan and Kadar, are cited in 
the Sangam literature using the same names that are employed 
today. Thus, a structured society practicnmg endogamy pre-existed 
in TN prior to the inferred arrival of the Indo-Aryans to this 
region. It is therefore most likely that the Varna system was 
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superimposed on the pre-existing and historically attested social 
system without significant population transfer or input, imple- 
menting a new social hierarchy and order during the Pallava/ 
Chola period from the 6" through 12" centuries CE [15,22]. 
However, the implementation of the Varna system may have not 
been uniform across preexisting non-tribal populations since many 
of the populations within DLF and tribes do not practice either 
Vedic rituals or have very definite patrilineal system and clan 
exogamy. Overall, our results suggest that the genetic impact of 
Brahmin migrations into T'N has been minimal and had no major 
effect on the establishment of the genetic structure currently 
detected in the region 


Models of agricultural expansions in the study region 
correlate with patterns of genetic diversity 

The present study shows that the MPG classification reflects the 
genetic structure of the T'N populations slightly better than other 
models, and that both tribal and non-tribal populations possess 
predominantly autochthonous lineages derived from a common 
gene pool established during the Late Pleistocene and Early 
Holocene. The distribution of over- and under-represented HGs 
suggests that populations within MPGs tend to share common 
genetic backgrounds. Using BATWING analysis, we estimate that 
social stratification for both tribal and non-tribal MPGs began 
between 6 Kya and 4 Kya, and detectable admixture between 
them has not occurred over the past 3 Kya, thereby allowing them 
to retain their genetic identity through cultural endogamy. 

Both the overall Y-chromosomal HG distribution and the 
divergence estimates for tribal and non-tribal groups, are 
consistent with the archaeological dates and the demographic 
processes involved in the expansion of agriculture in South Asia. 
The South Deccan region near southern Karnataka and southwest 
Andhra Pradesh contains the earliest evidence for an integrated 
agro-pastoral system in South India, and likely acted as 
agricultural center and source of dispersion around 5 Kya 
[30,31,34,69]. The genetic impact of the demographic processes 
involved during the development and spread of agriculture in 
India have been invoked under the Frontier theory framework 
[30]. According to this model, agricultural groups rapidly 
expanding into new environments suitable for farming created 
moving frontiers where autochthonous lineages from multiple pre- 
existing hunting and gathering forager populations were assimi- 
lated into the new agriculturalist populations, thereby producing 
centers of greater genetic diversity with less evidence of isolated 
evolution than observed in foraging populations. ‘This mechanism 
was proposed by Semino ¢¢ al, for convergence of multiple E-M123 
founders into ‘Turkey prior to re-expansion into Europe in order to 
explain the high diversity for that haplogroup [70]. The genetic 
patterns observed in this study, such as the presence of the oldest 
age estimates of autochthonous HGs found among the agricul- 
turalist non-tribal populations (DLF), could reflect assimilated 
paternal lineages from genetically diverse pre-existing populations 
into common gene pools, as well as to suggest that today’s tribal 
groups are not the sole source of these lineages. 

In addition to this moving frontier, broader and more static 
agricultural frontier zones could also have arisen at later stages. In 
this area, stable and growing farming populations interacted with 
local foragers and created new cultural traditions, with some 
potential inter-marriage and assimilation through trade taking 
place. Southern Tamil Nadu and the Kerala zone represent one 
such agricultural frontier zone that has persisted to the present 
after local foragers began to adopt cultivation based on 
agricultural sedentism around 3 Kya [30]. Nowadays, TN tribes 
exhibit a wide variety of occupations and subsistence strategies, 
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and mostly inhabit the Western Ghats Mountains, which harbor 
tropical and semi-tropical rain forests. In this context, two of the 
three tribal groups associated with foraging lifestyles (HTF and 
HTK) show the clearest signals of genetic drift, most likely due to 
strong founder effects and long-term isolation. They exhibit the 
lowest HG diversities (HTF: 0.687; HTK: 0.748), the highest 
proportion of putative autochthonous lineages (HTF: 95.3%; 
HTK: 88.5%), and the lowest ancestral effective population sizes 
estimated by BATWING (results not shown). In addition, the 
persistence of stronger genetic structure among HTF and HTK 
tribal populations, as seen n AMOVA, PCA and MDS analyses, 
suggests limited admixture with other TN populations. The 
absence of any human habitation sites in the Western Ghats until 
the Neolithic, and the late paleo-botanical evidence for cultivation, 
suggest a relatively late occupation of these mountains [34]. It is 
therefore possible that, upon agricultural expansion into previously 
non-cultivated areas, the present day tribal populations were 
displaced to more isolated regions, where they retained their mode 
of subsistence and genetic distinctiveness until the present day. 
The overall Y-chromosomal landscape of TN suggests a 
complex process of agricultural expansion, which can be explained 
in terms of the formation of moving and static frontiers since 
6 Kya, followed by migrations structured by habitat and 
occupation. However, because gene flow and differential assim- 
ilation of incoming migrations could alter the estimated divergence 
dates, they should be treated with caution. Our BATWING 
simulations and others from a previous study [62] have shown that 
topologies and population splits for modal trees are susceptible to 
admixture between already differentiated populations, which 
considerably reduces the times of split, but insensitive to migration 
into a region bringing new paternal lineages. ‘This means that the 
divergence time estimates presented here likely reflect the latest 
major admixture that occurred among the populations being 
sampled from the TN region. In this regard, it is important to note 
that our BATWING estimates are concordant with historical 
records of major splits between two Vanniyar and between two 
Nadar populations, thereby supporting the ability of BATWING 
to detect recent demographic events. ‘Thus, the main limitation of 
BATWING is related to its lack of power to detect earlier 
demographic events and its bias in clearly detecting recent gene 
flow among the populations studied. In any case, our conclusions 
supporting a common autochthonous Indian genetic heritage from 
the late Pleistocene/early Holocene for both tribal and non-tribal 
populations and refuting the hypothesis of the establishment of a 
structured and endogamous system due to an Indo-Aryan 
migration or implementation of the Varna System, still hold even 
if the BATWING divergence times are underestimates. 
Although previous genetic studies have already drawn some of 
the conclusions presented here [6,7,16,23], this is the first trme 
(which we are aware of) that a genetic study showed clear 
evidences of the existence of long-standing endogamous popula- 
tion identities within a highly structured Indian society established 
prior to the regional implementation of the Varna system. Further, 
these paternal genetic identities likely resulted as a byproduct of 
demographic processes that occurred during the creation of 
moving and static frontiers of agricultural expansions into ‘TN 
[30,69]. The meticulous sampling strategy focused on a local area, 
and comparison of genetic data with the paleoclimatic, arche- 
ological, and historical background information available for the 
region, allowed us to address these questions at a deeper level than 
previous studies have. Moreover, this approach reduced consid- 
erably the confounding relationships among socio-cultural factors 
allowing us to further explore and test in detail the relationships 
between ethnography and genetics. Indeed, the pattern of long- 
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term separation among populations within and between MPGs, 
and the genetic affinities of the constituent populations within 
MPGs, are significant features that would be lost if populations 
were pooled by other proxies based on broad classifications such as 
tribal versus non-tribal categories or Varna rank-caste hierarchy. 
We were also able to show that not all of the tribal populations 
reflect the oldest genetic legacy of the region and that each tribal 
population has a unique and distinct evolutionary history. 

Thus, the sampling and analytical approach employed here 
suggest that detailed local genetic studies within India could give 
us new insights about the relative influences of past demographic 
events in relation to other socio-cultural and economic factors that 
might have influenced the population structure of the whole of 
India that is observed today. Nevertheless, it cannot be assumed 
that the same demographic processes or socio-cultural factors 
affected Indian populations from different regions in a similar 
manner. Whether corresponding Y chromosome genetic patterns 
can be also detected in other tribal and non-tribal populations 
within the South Deccan or in other Indian regions that have 
already been identified as centers of agricultural expansions, are 
open questions that future studies could potentially address using 
the methods presented here. Finally, it would also be important to 
investigate the relative impact of the processes explained here on 
the diversity patterns in other genomic regions by studying 
mtDNA and autosomal variation. 


Supporting Information 


Figure Sl PCA plot showing the first two principal 
components of haplogroup frequencies for 97 non-tribal 
(circles) and tribal (squares) populations of India and 
nearby regions from previous publications, compared to 
the non-tribal (horizontal ovals) and tribal (diamonds) 
populations from the present study. Symbols have been 
colored according to linguistic classification. Population codes and 
references are shown in Table 83. 

(TIF) 

Figure $2. Reduced median network of 17 microsatellite 
haplotypes within haplogroup. (a) HG C-M130 using 74 
chromosomes, (b) HG H1-M52 using 292 chromosomes (c) HG 
H- M69 using 79 chromosomes, (d) HG L1 — M27/M76 using 235 
chromosomes, (ec) HG Rlal-M17 using 214 chromosomes. Circles 
are colored based on the 7 Major Population Groups as shown in 
Figure 1, and the area is proportional to the frequency of the 
sampled haplotypes. Branch lengths between circles are propor- 
tional to the number of mutations separating haplotypes. 


(TIFF) 


Figure $3 Modal tree obtained by BATWING indicating 
the coalescence time divergence estimates (in years) 
among Major Populations Groups (MPG) using 17 STRs 
from haplogroup (a) F-M89, (b) H1-M52, (c) L1-M26/ 
M72. 

(TIFF) 

Figure S4 Modal tree obtained by BATWING indicating 
the coalescence time divergence estimates (in years) 
among endogamous populations within (a) HTF and 
HTK groups, (b) DLF, (c) BRH and HTC, using 17 STRs 
from all haplogroups. 

(TIFF) 

Table S1 List of Y chromosome SNPS and haplotype 
data for the 1680 individuals from 31 tribal and non- 
tribal populations presented in this study. 


(XLS) 


November 2012 | Volume 7 | Issue 11 | e50269 


Table S2. AMOVA analysis of various population group- 
ings based on the 17STR haplotype & 95%CI based on 
re-sampling of the samples across populations. 


(XLS) 
Table S3_ List of population codes and their publication 
references used in Figure S1. 


(XLS) 
Table S4 Fishers exact test p-values for the NRY HG 


frequencies among the 7 Major Populations Groups and 
among the 31 sampled populations. 


(XLS) 
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INTRODUCTION 


“Genomics,” the study of the genome of a species, is 
the buzz word of the twenty-first century, thanks to the 
Human Genome project that ushered in this new era of 
biology (Baltimore, 2001). The genomic tools are simple, 
straightforward and more accurate, thus making biol- 
ogy a more exact science, similar to physics and chem- 
istry. Nonetheless, the purpose of the studies and the 
design of the experiments become more critical in such 
a venture due to the enormous diversity and complexity 
of the biological phenomena that operate in evolutionary 
processes. The Indian subcontinent, the second success- 
ful home of humankind, is special in this evolutionary 
process due to her population’s long history, many migra- 
tions, isolation, divergence, and cultural evolution since 
the first emigration of man (Wells et al., 2001). The impact 
of natural selection that has operated on these disparate 
gene pools in an alien environment is a matter of intense 
scrutiny, since no parallel for such longstanding and sym- 
patrically isolated populations exists in other parts of the 
world, apart from the birthplace of mankind in Africa. 
~ Many of these isolations seem to have occurred prior to 
language developments. The geographical subsistence and 
cultural isolations presumably lead to different language 
developments in various parts of India. We attempt here 
to interpret the Non Recombinant Y (NRY) chromosome 
polymorphisms of India in the context of migrations and 
origin of languages. 

In 1901, Karl Landsteiner, the discoverer of the human 
ABO blood group and Nobel laureate, first provided direct 
evidence for the existence of genomic diversity in human 
populations. In 1919, Hirszfeld and Hirszfeld found 
ABO gene variations among human populations. The 
B blood group was unique and most prevalent in South 
Asia, particularly in southern Indian tribal populations 
(Cavalli-Sforza et al., 1994). During the 1950s and 1960s, 
more systemic analysis of variation in genes and proteins 


became possible with the detection by Pauling et al. 
(1949) of blood protein polymorphisms in hemoglobin. 
The 1980s were a transitional period from the analysis of 
gene polymorphisms to protein polymorphism (Sanghvi 
et al., 1981), to the studies of DNA sequence polymor- 
phisms in the form of Human Genome and other vari- 
ome projects (Baltimore, 2001). To better understand the 
origin of this genomic diversity, one may need to study 
population-level forces such as migration and miscegena- 
tion, which play major roles in creating diversity. It has 
been proposed that genomic differentiation in popula- 
tions is mostly due to “fission” followed by independent 
evolution (Cavalli-Sforza, 1997). Mutations, natural selec- 
tion, and drift play important roles in deciphering diver- 
sity at the population level. While mutations supply raw 
material for genomic diversity by introducing new alleles, 
their survival and expansion is dependent on their fitness 
and functional importance. The study of polymorphism 
at the single nucleotide (SNP) level in introns (noncoding 
region) or exons (coding region), or at the microsatellite 
level, becomes a powerful tool in studying genomic diver- 
sity in both health and disease. Recent literature on NRY 
chromosomes makes them ideal candidates to study pop- 
ulation diversity. NRY is evolutionarily a neutral marker, 
thus permitting us to reconstruct a population’s history. 
The distribution of NRY variations of various linguistic 
states of India becomes more interesting. The analyses 
throw better light on the population migrations, language 
development, and its spread. 


RECENT AFRICAN ORIGINS 


The recent African origin and spread of anatomically 
modern humans suggested that Homo sapiens sapiens, 
our species, evolved from a small African population 
that had subsequently colonized the whole world, sup- 
planting former hominids, ~120-200 thousand years 


ago (kya) around the time of the first appearance of 
anatomically modern humans (Cann et al., 1987). This 
replacement model, now widely accepted, has been 
later called the “Out of Africa,” or “Recent African 
Origin” (RAO) model, in contrast to the earlier Multi- 
Regional Evolution Model (MRE; see Wolpoff et al., 
1984). Molecular evidence favors the RAO model. Older 
populations evolved for longer must have had more time 
to accumulate genomic diversity. The excess African 
diversity can thus be explained by older onset of popula- 
tion demographic expansion in Africa, combined with 
higher effective population size, population size fluctua- 
tions, and also periodic extinctions of populations out- 
side Africa or positive selection through adaptation to 
new environments outside Africa (Eller, 2001; Aquadro 
et al., 2001). The non-African patterns of genetic varia- 
tion are indeed a subset of African ones. Microsatellite 
studies also showed a gradual reduction of diversity with 
increasing distance from Africa, and linkage disequilib- 
rium values, which reflect the lower ages of haplotypes 
in non-African populations (Tishkoff et al., 1996). The 
Indian subcontinent, the second to be occupied by man, 
thus attracts our attention to investigate further in these 
directions. The RAO model proposes one, two, or multi- 
ple migrations using various routes over a period of time. 
Two routes have been proposed: the first is the “north- 
ern route” over Sinai, leading to eastern Asia through 
the steppes of central Asia and southern Siberia, and the 
second is the “southern route” over southern Arabia, fol- 
lowed by migration along the coastline of India. While 
the northern route model could explain the peopling of 
the whole of Eurasia by a single migration from Africa, 
the southern route model is interpreted as implying at 
least two separate late Pleistocene dispersal events, one 
leading to the northwest and the other to the east of 
Eurasia (Cavalli-Sforza et al., 1994). 


INDIAN CORRIDOR 


Being positioned at the tri-junction of African, northern 
Eurasian and oriental realms, India has served as a major 
corridor for the dispersal of modern humans (Cann, 2001) 
and attracted many streams of people since the Paleolithic, 
starting with the Late Pleistocene as supported by archae- 
ological evidence (Paddaya, 1982; Misra, 2001; Petraglia 
et al , 2010). Though the modern anthropology tends to 
reject the somatoscopic and anthropological measure- 
ments, there is a revival of interest in deciphering skin 
color genes and studying their genome with modern tools 
(Yuldasheva et al., 2002). Sanghvi and Karve, distinguish- 
ing various castes of Tamil Nadu, India, have deciphered 
that nose shape and skin color are the most discrimina- 
tive (Sanghvi et al., 1981). Even today, the Indian physi- 
’ cal anthropologists consider these ancient classifications, 
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identifying four different morphological groups in India 
(Bhasin, 2006). These are: (i) “negritos,” characterized by 
dwarf stature and frizzy hair, who are common in Nilgiri 
hills of Tamil Nadu (Paniya, Irula, and Kadar tribes) and 
the Andaman islands: we see them nowadays in many caste 
populations, including Brahmins; (ii) “proto-Austroloids,” 
characterized by long head, dark skin, and broad nose, 
found in central and southern India and speaking 
Dravidian languages/dialects; (iii) “Mongoloids,” char- 
acterized by broad face, medium stature, yellow skin, 
and slightly obliquely set eyes, exclusively found in sub- 
Himalayan and northeastern regions, speaking Austro- 
Asiatic (AA) or Tibeto-Burman (TB) languages; and (iv) 
“Dinaric” type (Mediterranean element) with medium to 
light pigment, hook nose, acrocephalic and round heads, 
found in Bengal and Orissa. The “Caucasoids” or the 
“Nordic”, with blond hair and long heads and speaking 
Indo-European (IE) languages is most common in the 
north and northwestern regions of India. The four major 
language families of India seem to have their own non- 
overlapping geographic clines. It will be interesting to 
compare the distribution of the NRY markers and the ori- 
gin of these languages, and answer whether these could 
have arisen through fission and a long process of isolation 
in various regions of India. 


NRY PHYLOGENY IN INDIA 
AFRICAN ROOT 


The roots of Y phylogeny roots in Africa have been dated 
around 100 kya (Underhill, 2003), characterized by HG 
A-M91 and HG B-M60 NRY-SNP haplogroups (HGs), 
and restricted to Africa. These migrations and subse- 
quent mutations formed the scaffold on which all other 
Y- chromosome diversification with geographical cline 
has occurred. The majority of Y lineages across the globe 
are composed of a tripartite assemblage consisting of 
(1) HG C-M130, (2) HG D-M174 and HG E-M96, and 
(3) overarching HG F-M89, which defines the internal 
node of all remaining HGs, G-M201 through R-M207 
(Underhill et al., 2001; Wells, 2007). 


OUT OF AFRICA EMIGRATIONS 


The HG C-M130, not seen in any African populations 
presumably originated somewhere in Asia on an M168 
lineage sometime after an early departure event (Capelli 
et al., 2001; Underhill et al., 2001; Table 74-1). This clade 
(C-M130) characterizes the first migrants into India: the 
descendants we could identify near Madurai (Wells et al., 
2001). This clade has many sublineages displaying irregu- 
lar geographic patterning consistent with diversification 
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and northward migration of this HG C-M130, since the 
last ice age, with the westernmost limit in India: thus HG 
C3-M217, a transversion mutation, is common in East 
Asia and Siberia, with representatives in North America 
(Karafet et al., 2001; Lell et al., 2002), and eastern and 
central parts of central Asia, while C5-M356 is common 
in India (Sengupta et al., 2006). This lineage is absent in 
Indonesia, Oceania (Kayser et al., 2000), and Yunnan, 
China (Karafet et al., 2001). 

The ancestors who accumulated HG D-M174 and HG 
E-M96 mutations could have arisen in Africa or Asia 
(Underhill, 2003). HG E-M96 lineages are the most fre- 
quent in Africa, and display subsequent binary and mic- 
rosatellite diversification. Conversely, Asian haplogroup 
D-MI174 occurs at low frequencies throughout eastern 
Asia, except in remote and isolated locations like Tibet, 
Japan, and the Andaman islands (Underhill et al., 2001; 
Thangaraj et al., 2003). 

The third major and most successful subclade of M168 
lineages, characterized by super-haplogroup F-M89, 
defines the root from which all others (HGs G-M201 


through R-M207) originated and have evolved outside 
Africa (Kivisild et al., 2003). HG F-M89 diversified into 
many branches with region-specific markers—the Middle 
East showing HGs G-M201 and J-M304, Europe with HG 
I-M170, and India with F-M89 and H-M69 lineages, sel- 
dom observed elsewhere (Table 74-1). 


EXPANSION IN INDIA 


HG F*-M89* is the most paraphyletic subcluster (unclas- 
sified derivative) of M168 lineages, ubiquitous but found 
with lesser frequency in various parts of India. Many 
tribal populations of southern India possess higher fre- 
quencies of F*-M89* with high STR variance, particularly 
Dravidian speaking groups of Tamil Nadu and Koya of 
Orissa (Kavitha, 2008; Wells et al., 2001; Kivisild et al., 
2003; Cordaux et al., 2004a; Table 74-2). The high STR 
variance of HG F*-M89* from Tamil Nadu and Andhra 
Pradesh has suggested a deep time depth of 45,000 YBP 
(Sengupta et al., 2006). 


TABLE 74-1 THE AGES OF THE NRY HAPLOGROUPS AND THEIR DISCRIMINATING ALLELES 


PREVALENT IN INDIA AND NEARBY REGIONS. 


NRYHG Marker Estimated Age 


of the mutation 


Distribution 


Reference 


VBR? 
Cc M130 ] 50,000 | India, Australia, Central Asia America, | Genographic* 
F M89 | 45,000 | India Genographic# 
G | M201 30,000 Genographic# 
H M69 20,000-30,000 | Genographic# 
H1 M52 | 25,000 India Genographic# 
J M304 31,700 Middle East Semino et al. (2004) 
J2 M172 15,000-—20,000 | Hammer et al. (2000) 
K Mg | 40,000 Genographic# 
it M20 30,000 India Genographic# 
[sl | M27/M76 9100 Sengupta et al. (2006) 
(0) M175 35,000 Orissa, North East, South East Asia, Genographic# 
02a M95 11,700 Orissa Sengupta et al. (2006) 
03 M122 10,000 North East, South East Asia, China Genographic# 
IP M45 40,000 North Asia Wells et al. (2001) 
Q M242 15,000-18,000 Seielstad et al. (2003) 
M207 30,000 Genographic# 
R2 M124 25,000 India Genographic# 
Riat M17 15,000 Caucus, Europe, India Wells et al. (2001) 


*The estimated ages have all been determined based on the available data: if it was nota representative sampling, then the age may vary. 
Ascertainment bias is possible due to smaller samples and sampling errors; hence, some of these ages may need to be considered with 


caution. 


*www.nationalgeographic.com/genographic website. 
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TABLE 74-2 NRY HG ALLELE DISTRIBUTION DATA USED FOR COMPUTING PAN-ASIAN PCA 


Population 
Codes in 
Serial Country / Province / Language Pan-Asian Sample 
No Population Region State families PCA, Fig2 size C-M130 D-M174 E*-M96 G-M201 I-M170_ F*-M89 
st Pathan Pakistan IE Pak7 21 4.762 10) 0 9.524 0 4.762 
2 Sindhi Pakistan IE Pak8& 10) 10) 0 4.762 
3 Hazara Pakistan IE Pak4 0 0 4 (0) 
4 Kalash Pakistan IE Pak5S 0 20 10} 0 
— 
5 Makrani Pakistan IE Pak6 5 0 (0) 0 
6 Konka Brahmin India West | Goa IE Goat 10) 0 (0) 4.651 
alk 
ie Gujarat India West | Gujarat IE Guj3 10) 0 0 3.448 
8 Gujarat Brahmin India West | Gujarat IE Gujt 3.125 10.94 10) 10) 
9 Bhils India West | Gujarat IE Guj2 22 9.091 10) 10) 0 (0) 18.18 
10 Desasath Brahmin | India West | Maharashtra IE Mah2 16 6.25 0 0 10) 10) 10) 
44 Kathari India West | Maharashtra IE Mah3 19 10) 10) 0 0 0 26.32 
12 Maratha India West | Maharashtra IE MahS 36 5.556 10) 0 0 (0) 5.556 
43 Punjab India West | Punjab IE Pun2 66 3.03 0 0 0 0 4.515) 
14 Punjab Brahmin India North | Punjab IE Punt 49 4.082 (0) ie) 4.082 (0) 4.082 
a5: Kashmir Gujars India North | Jammu Kashmir | IE JS&KL 49 2.041 10) 0 10) (e) 4.082 
ie 
16 Kashmiri Pandits | India North | Jammu Kashmir | IE J&K2 51. 1.961 0 te) 1.961 0 3.922 
17 Rajput India North | Rajasthan IE Raji 29 3.448 0 0 0 10) 10.34 
18 Uttar Pradesh India North | Uttar Pradesh IE Uprd Si 0 (0) 0 10) 0 te) 
Brahmin 
——— ee 

19 Bihar Brahmins India Bihar IE Biht 56 1.786 (0) 10) 10) 0 0 

Central 
20 Madhya Pradesh India Madhya IE MP1 42 10) 10} 10) 0 0 2.381 

Brahmins Central Pradesh 

21 Halba India Maharashtra IE Mah4 2a 0 0 0 0 10) 23.81 

Central 
22 Karan India Orissa IE Orit 18 0 =) 10) 10) 0 10) 10} 

Central 
23 Oriya Brahmin: India Orissa IE Ori2 24 (0) 0 0 10) 10) 4.167 

Central 
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TABLE 74-2 (CONTINUED) 


P-M74/  Q*-M242/ 
H1*-M52 J2-M172 L-M20/M11 K-M9 N-M231 0O*-M175 O2a-M95 03-M122 M45 P36 R*-M207 Rial-M17 R2-M124 Reference 
9.524 0 9.524 (0) 0 (0) 0 0 fe) 9.524 14.29 38.1 Sengupta 
et al., 2006 
(0) 28.57 4.762 0 (0) (0) 0 0 0 4.762 0 Sengupta 
et al., 2006 
0 4 10) 0 0 0 0 8 10) 8 32 10) Sengupta 
et al., 2006 
20 10 25 10) 10) (0) (0) 10) 10) (0) 5 Sengupta 
et al., 2006 
0 25 20 0 0 (0) 0 0 0 5 10 Sengupta 
et al., 2006 
6.977 13.95 18.6 2.326 |0 (0) 0 {0} 0 (0) ie} 9.302 Kivisild et al., 
2003 
10.34 20.69 10.34 3.448 |0 0 0 (0) 6.897 |0 0 3.448 Kivisild et al., 
2003 
1.563 15.63 7.813 S125) |/3:425 io} O {0} 0 0 9.375 9.375 Sharma et al., 
| 2009 
9.091 18.18 18.18 ie} (e} ie} (0) 0 (0) (0) 0 18.18 Sharma et al., 
2009 
18.75 12.5 12)5 (0) 0 0 (0) (e} ie} {0} 0 Sahoo et al., 
2006 
36.84 5.263 5.263 ie} 0 0 0 (e} 5.263 |0 5.263 15.79 ie} Sahoo et al., 
2006 
30.56 19.44 ss Is 0 0 (0) 0 (0) 0 0 0 13.89 13.89 Sengupta 
et al., 2006, 
Sahoo et al., 
2006 
3.03 21.21 12.12 fe) 0 (e) 0 (0) 7.576 (0) 0 46.97 4.545 Kivisild et al., 
2003 
0 22.45 6.122 0 0 0 (0) {0} 0 (0) ie} 34.69 24.49 Sharma et al., 
2009 
10.2 6.122 16.33 8.163 | 0 0 (0) 0 0 2.041 2.041 40.82 8.163 Sharma et al., 
2009 
9.804 9.804 5.882 9.804 |0 0 (0) 0 (0) 5.882 17.65 19.61 13.73 Sharma et al., 
2009 
17.24 13.79 6.897 0 0 0 (0) 3.448 (e) (0) te) 31.03 13.79 Sengupta 
et al., 2006 
16.13 3.226 3.226 (e} ie} 0 fe) (0) 0 6.452 0 67.74 3.226 Sharma et al., 
2009 
0 8.929 8.929 3.571 |0 0 (e) (0) 0 3.572 3.574 64.29 5:357 Sahoo et al., 
2006, Sharma 
et al 2009 
7.143 23.81 7.143 (e} 2.381 fe) 0 (0) 2.381 | 4.762 (0) 38.1 11.9 Sharma et al., 
2009 
23.81 0 10} ie} 0 fe) 28.57 0 (0) 4.762 0 19.05 fe) Sengupta 
et al., 2006 
16.67 5.556 0 (e) 0 fe) (0) 0 (0) (0) 0 55.56 22.22 Sahoo et al., 
2006 
—————— 
8.333 4.167 20.83 fe) 0 0 (0) oO 4.167 fe) 4.167 41.67 42-5) Sahoo et al., 
2006 
(Continued) 
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TABLE 74-2 (CONTINUED) 


Population 
Codes in 
Serial Country / Province / Language Pan-Asian Sample 
No Population Region State families PCA, Fig2 size C-M130 D-M174 E*-M96 G-M201 I-M170 F*-M89 
24 Lambadi India South | Andhra Pradesh | IE AndQ 53 44532 Oo 0 10) 10} 3.774 
25 W.Bengal India East West Bengal IE WB1 Suk 3.226 (0) 10) 3.226 10) 6.452 
26 Karmali India East West Bengal IE WB4 16 10) 0 10} 0 10} 0 
27 Kora India East West Bengal IE WB5 17 1¢) 10) 10) 10) 10) A765: 
28 WB. Brahmin India East | West Bengal IE WB7 49 (e) (0) 0 0) fe) 0 
29 Garo India East Meghalaya TB NE2 33 0 10) 10) 0 10) 18.18 
30 Jamatia India East West Bengal TB NE3 30 10} 0 0 ie} 0 10) 
SHE Korku India West | Maharashtra AA Mahi 59 0 0 10) 0 fe) 15.25 
ir 
32 Asur India Jharkand AA Jha 55 0 0 0 0 0 25.45 
Central 
| entra | 
33 Birjia India Jharkand AA Jha2 24 0 0 0 ie) 0 0 
Central 
: Fi 
34 Korwa India Jharkand AA Jha3 42 10) 0 10) 0 10) Ss Weis} 
Central 
35 Savar India Jharkand AA Jha4 47 10) 10) 10) 0 0 40.43 
Central | 
36 Kharia India Jharkand AA Jha5 46 2.174 0 10} 10) 10) 39.13 
Central 
37 Munda India Jharkand AA Jha6 60 10} 10} 10) 0 0 23.33 
Central 
{ r | ai] 
38 Juang India Orissa AA Ori3 59 10) 10) (0) 10) 10) 1.695 
Central 
39 Ho India Orissa AA Orid 116 0 ie) | 10) 10) 10) 22.41 
Central 
40 Mahali | India West Bengal AA WB3 38 0 10} 10) (0) 10) 39.47 
Central 
| 
441 Khasi India East Meghalaya AA NEL 92 (0) 10) 10) 0 10) 17.39 
IE 
42 Mudi India East | West Bengal AA WB2 Sif 0 10) (0) 10) 0 45.95 
in 
43 Lodha India East West Bengal AA WB6 71 1.408 10) 0 0 0 14.08 
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TABLE 74-2 (CONTINUED) 


P-M74/  Q*-M242/ 
H1*-M52 J2-M172 L-M20/M11 K-M9 N-M231 O*-M175 O2a-M95 03-M122 M45 P36 R*-M207 Riai-M17 R2-M1i24 Reference 


5.66 3.774 1432 3.774 |0 (0) (0) 0 33.96 |0 132 13.21 1.887 Sahoo et al., 
2006, Kivisild 
et al., 2003 


9.677 6.452 0 (0) 0 3.226 (0) 0 6.452 0 0 38.71 22.58 Kivisild et al., 
2003 


0 0 0 0 0 10) 10) 10) 0 100 Sahoo et al., 
2006 


(0) 0 10} 29.41 0 (0) 0 0 5.882 10} Sahoo et al., 
2006 


(0) 10} 0 0 0 (0) 0 (0) 71.43 22.45 Sharma 

et al., 2009, 
Sengupta 

et al., 2006 


(0) 0 3.03 18.18 54.55 (0) (0) 6.061 Kumar et al., 


2007 


te) 10) 10) 6.667 76.67 10) 10} 10) Sengupta 


et al., 2006 


(0) te) ie} 81.36 1.695 1.695 |0 0 


Kumar et al., 
2007 


10) 0 0 63.64 0 10) 0 1.818 Kumar et al., 


2007 


10) 10} 0 95.83 0 4.167 0 0 Kumar et al., 


2007 


ie} fe) 0 59,52 0 10) 4.762 Kumar et al., 


2007 


31.91 0 10) Kumar et al., 
2007 


0 12.77 0 (0) 0 0 14.89 


6.522 10} 10) Kumar et al., 
2007, Sahoo 
et al., 2006 


2.174 2.174 (0) 10} te) 2.174 45.65 


11.67 10} 6.667 Kumar et al., 
2007, Sahoo 
et al., 2006 


10} 0 (0) 1.667 |0 0 50 


0 fe) 0 {0} te) 0 98.31 ie} Sahoo et al., 
2006, Kumar 


et al., 2007 


0 0.862 0 (0) (e) 0 74.,.55' (0) 2.586 |0 2.586 Sengupta 

et al., 2006, 
Kumar et al., 
2007, Sahoo 


et al., 2006 


13.16 5.263 (0) 5.263 | 0 0 7.895 0 0 0 13.16 Kumar et al., 
2007, Sahoo 


et al., 2006 


te) Kumar et al., 


2007 


(0) 0 0) 0 (0) 2.174 41.3 29.35 4.348 |0 


(0) 2.703 (0) 10} ie} 0 43.24 (0) 2.703 (0) 2.703 Kumar et al., 


2007 


5.634 30.99 te) 2.817 |0 (e) 5.634 (e) 0 (0) 0 1.408 38.03 Sengupta 

et al., 2006, 
Kumar et al., 
2007 


(Continued) 
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TABLE 74-2 (CONTINUED) 


Population 
Codes in 
Serial Country/ Province / Language Pan-Asian Sample 
No Population Region State families PCA, Fig2 size C-M130 D-M174 E*-M96 G-M201 I-M170 F*-M89 
44 Oraon India Jharkand DR Jha7 100 10} 0 0 10) 0 54 
Central 
1 
45 Muria India Orissa DR Ori4 20 10} (0) 10) 0 0 10 
Central 
+— 
46 Koraga India South | Andhra Pradesh | DR Andi 33 10) 6.061 0 10) (0) (0) 
47 Koya (oe South | Andhra Pradesh | DR And2 41 0 10} (0) 0 10} 36.59 
7 4 
48 Yerava India South | Andhra Pradesh | DR And3 x 26.83 0 0 te) 0 43.9 
49 Kappu naidu India South | Andhra Pradesh | DR And4 18 0 1¢) 0 5.556 0 (0) 
50 Komati India South | Andhra Pradesh | DR And5 20 (0) (0) (0) 5 (0) 10 
7 ] ' 
Bas Naikpod Gond India South | Andhra Pradesh | DR And6 Ps 22.22 0 0 0 10) Hele lh 
52 Raju India South | Andhra Pradesh | DR And7 19 10} 0 0 | 10) (0) 10) 
IE 
53 Yerkual India South | Andhra Pradesh | DR And8& 18 0 fe) 0 (¢) (0) 0 
7 
54 Konda Reddy India South | Andhra Pradesh | DR And10 30 ie} (e} fe) 0 (0) 23.33 
55 Koya Dora India South | Andhra Pradesh | DR And1it 27, (0) 0 0 1@) 0 25.93 
56 Andh | India South | Andhra Pradesh | DR | And1i2 54 ers 0 10) 0 10) 3.704 
57 lyer India South | Tamil Nadu DR TN1 29 6.897 ie) (0) 10.34 0 3.448 
I. 
58 Kurumba India South | Tamil Nadu DR TN2 19 10) 10) 10) 10) 10) 15.79 
59 lyengar India South | Tamil Nadu DR TN3 47 0 0 10} 8.511 0 0 
60 lrula India South | Tamil Nadu DR TN4 40 5 10} 0 10) 0 42.5 
61 Pallan India South | Tamil Nadu DR TN5 44 2.273 10} 0 0 0 6.818 
62 Kallar India South | Tamil Nadu DR TN6 93 6.452 (0) 0 0 10) 16.13 
63 Sinhalese SriLanka SriLanka DR SL1 39 10} 10) fe) 0 0 12.82 
; 
64 Burushaki Pakistan unclassified (a 20 5 0 10) 5 10) (0) 


Abbreviations of language families: IE=Indo European, AF=Afro-Asiatic, DR=Dravidian, TB=Tibeto-Burman, ST=Sino-Tibetan, AA=Austro-Asiatic, AT=Altaic, BR=Brusaki. 
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P-M74/ Q*-M242/ 


H1*-M52 J2-M172 L-M20/M11 K-M9 N-M231 O*-M175 O2a-M95 03-M122 M45 P36 R*-M207 Rial-M17 R2-M124 Reference 
3 te) 0 10) te) 0 35 0 4 10} 4 0 3 Kumar et al., 
2007, Sahoo 
et al., 2006 
~ = | i ie 1 4 
80 0 te) (0) 0 io} 10 ie} 0 (0) {0} 0 Sengupta 
et al., 2006 
+ | 12 Ie ie i = | 4 | 
87.88 0 0 te) 10} ie} 0 0 te) 0 0 0 6.061 Cordaux et al., 
2004a 
T -— fH — | — — r 
60.98 0 10} 0 te) 0 (0) 0 0 0 (0) 2.439 io} Kivisild et al., 
200 
aa eal J He I | J M| ‘ 
19.51 0 0 0 0 0 fo) 0 0 0 (0) 9.756 0 Cordaux 
et al.,2004a 
Fi 1 [ ie C t 
fe) 0 0 0 (0) 0 (0) 0 0 (e) 44 4 Ae ala 72,22 Sahoo et al., 
2006 
il M +— j— if 
(0) 10) 10) (0) 10} 0 (0) 0 0 10} (0) 415 70 Sahoo et al., 
| | 2006 
r =I = al 
61e4e1 0 5.556 (0) (0) 0 0 ie} 0 (0) 0 {0} 0 Sahoo et al., 
2006 
[ L 
fe) 10.53 21.05 15.79 |0 10} (0) 0 0 (0) 15.79 26.32 10.53 Sahoo et al., 
2006 
- 
fo) 0 Ades: 55.56) |/0 0 0 0 0 fe) (0) 33.33 0 Sahoo et al., 
2006 
+ 
S333) fe) ie} 0 (0) 0 66.67 0 0 (e) (0) 6.667 0 Sengupta 
et al., 2006 
L 
22.22 3.704 0 0 (0) ie} 48.15 0 (6) fe) ie} te) 0 Sengupta 
et al., 2006 
16.67 35.19 1.852 (0) 0 fe) 1.852 1.852 (0) 0 0 31.48 5.556 Thanseem 
et al., 2006 
+— + 4+. T 
3.448 17.24 17.24 fe) 0 (0) 0 0 (e) 0 3.448 27.59 10.34 Sengupta 
t al., 2006 
le L | Ee 
68.42 fe) 5.263 fe) (0) (0) fe) 0 ie} 0 (0) (0) 10.53 Sengupta 
et al., 2006 
al | HE + 
23.4 19.15 19.15 0 (e) 0 fe) 0 oO 0 0 23.4 6.383 Sengupta 
et al., 2006, 
Sahoo et al., 
2006 
7 j ic iz Te T 
SD 2.5 7.5 (0) (0) (e) 0 fe) (e) (0) (0) 0 1D Sengupta 
et al., 2006, 
Sahoo et al., 
2006 
ie 4) I _ L + aI 
29.55 9.091 11.36 9.091 |0 0 (0) 0 0 (0) 2.273 15/91 13.64 Sengupta 
et al., 2006, 
Sahoo et al., 
2006 
\e Ie =| =| = { 
18.28 1.075 44.09 (0) (0) fo) 1.075 0 1.075 0 0 3.226 8.602 Wells et al., 
2001, Sahoo 
et al., 2006 
L 
- + + = = 
7.692 10.26 17.95 fo) 0 (0) 0 0 0 (0) (0) 12.82 38.46 Kivisild et al., 
2003 
IL = + i 
15 5 15 5 ie) 0 10) 5 0 10) 30 ie) 15, Sengupta 
il et al., 2006 
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NRY HG HI1-M532, a derivative of M69, has been 
reported in higher frequencies in southern India, cut- 
ting across the caste and tribal boundaries (Wells 
et al., 2001; Kivisild et al., 2003). A few other studies 
have found HG H1-M52 with high STR variance in 
Maharashtra (Sengupta et al., 2006) and western India 
(Trivedi et al., 2008). Thanseem et al. (2006) have sug- 
gested that M52 originated in the Indian subcontinent 
immediately after the Late Pleistocene settlements. The 
available samples in literature show an estimated age of 
25,000 years (Table 74-1). 


NEOLITHIC CATTLE KEEPERS 


The J-M172 clade implicated in agricultural expansion 
through Neolithic cattle keepers is thought to have arisen 
in the Caucusus and Anatolia and spread to southwest- 
ern Europe (Cavalli-Sforza et al., 1994; Semino et al., 
2004; Hammer et al., 1998; Rosser et al., 2000); Bedouin 
and Palestinian Arabs possess the highest frequency of 
this mutation (66%—55%) followed by Sephardic Jews 
and Muslim Kurds (40%; see Semino et al., 2004). The 
J HG is divided into two sub-haplogroups, J1-M267 and 
J2-M172, with the former showing an ancestral Y-STR 
haplotype 14-16-23-11-12, for loci DYS19-DYS388- 
DYS390-DYS392-DYS393 and the latter 14-15-23-11-12 
(Giacomo et al., 2004). A one-step mutated haplotype 
of this J2, viz. 15-15-23-11-12, is the common clade in 
India: the Thodas of Nilgiris possess this J2 in higher fre- 
quencies, correlating with their pastoral buffalo cult life 
(Kavitha, 2008). It has been proposed that earlier migra- 
tions brought agriculture and Dravidian speakers into 
India, while another, much later one brought rice cultiva- 
tors from Southeast Asia (Diamond and Bellwood, 2003; 
Fuller, 2003). HG J2-M172 has shown a high STR diversity 
in Dravidian tribal populations, but to hypothesize that 
this HGJ2-M172 is a part of a Neolithic expansion would 
require more evidence (Thanseem et al., 2006). The avail- 
able datasets thus do not correlate well with these two 
major events. The archaeology once again does not sup- 
port this contention. 

Some studies have found HG J2-M172 at higher fre- 
quencies in Dravidian and Indo-European castes than 
in tribes (Sengupta et al., 2006, Cordaux et al., 2004a). 
It is absent in East Asia, and typically present in Central 
Asia at frequencies of 10%-20%, leading Cordaux et al. 
to interpret that Indian HG J2-M172 originated from 
Central Asia rather than West Asia. The data listed in 
Table 74-2 shows the presence of J2 in a wide variety 
of populations across India. Neolithic markers of early 
farmers—HGs E3-M35 and G-M201, that are prevalent 
in Europe, Anatolia, the southern Caucusus, and Iran— 
are, however, sporadic in Indians (Semino et al., 2000; 
Underhill et al., 2001). 
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CENTRAL ASIAN EXPANSION 


An expansion of HG F-M839 lineages toward Central Asia 
or the Caucusus also gave rise to a founder that acquired 
the HG K-M9 mutation, defining another major bifur- 
cation in the phylogeny. Distinctive HG K-M9 sublin- 
eages have been observed in India, the Middle East, and 
Europe, while some HG K-M9 and HG M-M186 lineages 
are restricted to Oceania. Three major lineages of K-M9, 
HG P-M45 are characteristic of North Asia, while HG 
Q-M242 is found in Siberia and North America, and the 
westward-expanding HG R-M207 in Eurasia. K-M9 has 
given rise to two offshoots, one HG L-M20 prevalent in 
the Indian subcontinent to become L1-M76 in southern 
India, and another HG O-M175 found in eastern Asia, the 
whole of oriental populations including the Chinese, and 
also in the Austro-Asiatic speakers and Tibeto-Burmese 
speakers of India. The genomic evidence further supports 
this (HUGO Pan-Asian SNP Consortium, 2009). 


ORIGIN OF L AND DRAVIDIAN SPEAKERS 


The NRY HG L-M20 is virtually absent in Europe, but 
found irregularly and at low frequencies in populations 
of the Middle East and southern Caucusus (Nebel et al., 
2001). It occurs at a frequency of 4.3% in Pakistan and 
13.5% in Central Asia (Qamar et al., 2002; Semino et al., 
2000; Wells et al., 2001). Ata resolution of six STR loci, four 
Chenchu tribal individuals from Andhra Pradesh shared 
a widespread common haplotype 14-12-22-10-14-11; 
DYS19- DYS388- DYS390-DYS391- DYS392- DYS393. 
This is shared by Lambadis, Punjabis, and Iranians. An 
Armenian haplotype 15-12-23-10-13-11, commonly found 
in their HG L-M20, is a three-step mutation (Weale et al., 
2001). These differences indicate two distinct founders 
and independent expansions: more data is required to 
identify the antiquity of these populations. The hitherto 
available L subtyping data shows the presence of HG 
L1-M76 in many northwestern states and Dravidian- 
speaking southern belts of India (Trivedi et al., 2008). 
Sengupta et al. (2006) found a subtype of HG L-M20 to 
be the most common haplogroup in India, and proposed 
its early diversification in Dravidian speakers and subse- 
quent expansion toward peripheral regions, suggesting 
an Indian origin of Dravidian speakers. The Brahmin 
populations from Tamil Nadu have been considered as 
Dravidian speakers, to prove their argument. However, 
Sahoo et al. (2006) observed absence of HG L-M20 in IE 
speakers from Bihar, Orissa, and West Bengal, and has 
concluded that distribution of NRY HGs in India was 
associated with geography rather than linguistics. Among 
Austro-Asiatic (AA) speakers of India, as mentioned ear- 
lier, HG O2-M95 is predominantly a Southeast Asian 
marker (Basu et al., 2003) and virtually absent in central 
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Asia (Wells et al., 2001). Thanseem et al. (2006) have found 
HG O2-M95 at the highest frequency in AA tribes (52%), 
and a deeper coalescence age (68,000 YBP) that does not 
fit with the history of other NRY clades. Non-AA castes 
and tribes have a frequency of this marker of 6.3%, and 
the scenario has suggested a footprint of earlier AA set- 
tlers carrying this defining mutation. HG O3-M122, and 
its sublineage HG O3e-M134 that spread through East 
Asia (Su et al., 2000) showed the highest frequency among 
Tibeto-Burman (TB) speakers of North East India , while 
the caste groups of the region possess only 3% (Trivedi 
et al., 2006). Further, since the coalescence age for HGs 
C-M130, H-M69, and R2-M124 was deeper compared to 
HG O-M175, they concluded that AA speakers could not 
have been the earliest settlers of India. More recent data 
in Table 74-1, however, suggest an estimated age of 35,000 
years for HGO-M175 and 11,700 years for HG O2a-M95. 


R2 RESTRICTED TO INDIA AND ITS NEIGHBORS 


HG R2-M124, a last major clade of significance to appear 
in India, is restricted to India, Pakistan, Iran, and southern 
Central Asia (Kivisild et al., 2003); however, this has been 
seen with highest frequency (53%) among Sinte Romani 
(Gypsy) (Wells et al., 2001). Cordaux et al. (2004a) have 
suggested that this HG R2 originated in India; this con- 
clusion was based on the presence of this clade in both 
Dravidian and Indo-European speakers. Within India 
it is predominant in the east coast and southern India 
(Sahoo et al., 2006). Network analysis of available data has 
depicted that a large number of haplotypes were shared 
between populations of South India, while the popula- 
tions of eastern India harbored more discrete haplotypes, 
originating in situ. 


THE ENIGMA OF R1A1 


Contrary to R2, the widespread northern Indian clade 
among Brahmin-related groups, HG Rlal-M17, has been 
linked with the recent spread of Kurgan culture origi- 
nating in southern Russia/ Ukraine and dispersing to 
Europe, Central Asia, and India between 3000-1000 BCE 
(Passarino et al., 2001; Quintana-Murci et al., 2001; Wells 
et al., 2001). In a global analysis, a deeper Palaeolithic 
time depth of ~15,000 YBP for HG Rlal-M17 mutation 
has been suggested (Semino et al., 2000; Wells et al., 2001). 
Further, two region-specific Y-STR allele patterns have 
been associated with HG Rlal-M17 among Europeans 
(Passarino et al., 2002): allele 15 at DYS19 and alleles 19 
and 21 at locus YCA Ila,b against the background of HG 
Rlal-M17 characterize populations of Western Europe, 
while alleles 16 for DYS19 and 19,23 for YCA Ila,b charac- 
terize Eastern European populations. 
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Interestingly, the high frequency of HG Rlal-M17 is 
concentrated around the elevated terrain of central and 
western Asia, and is present at a relatively low frequency 
in Caucusus and Middle East. In Central Asia, its fre- 
quency is highest in the highlands among Tajiks, Kyrgyz, 
and Altais (>50%) and drops down to <10% in the plains 
among the Turkmenians and Kazakhs (Wells et al., 2001; 
Zerjal et al., 2002). In contrast to the above, other stud- 
ies have observed a high HG frequency in Central Asians 
and lower average STR diversity than in Indian castes and 
tribes. This has been attributed to a founder effect from 
southern and western Asia during the early Holocene 
expansion, contributing HG Rlal-M17 chromosomes to 
both Central Asian and South Asian tribes prior to the 
arrival of the Indo-European speakers (Kivisild et al., 
2003; Thanseem et al., 2006; Trivedi et al., 2008). Zerjal 
et al. (2002), however, attributed the low Y- STR diversity 
to a bottleneck effect in Central Asian populations. Some 
authors also propose an Indian origin for the HG Rlal- 
M17 based on the high frequency and associated STR 
variance in India (Sharma et al., 2009), while others attri- 
bute the origin to C.Asia (Wells et al., 2001). While exten- 
sive subclades and subtypes have been identified for NRY 
HG Rib, the Rlal has been the least studied (Underhill 
et al., 2010) for lack of new markers; we await more data 
from our genographic project (www.nationalgeographic. 
com/genographic), in order to further decipher the early 
population-movements scenario. 


LANGUAGE CORRELATES OF NRY 
DISTRIBUTION—A PRINCIPAL COMPONENT 
ANALYSIS 


The overall NRY diaspora based on the hitherto avail- 
able data suggests a pattern of peopling of India. While 
HG Rlal is prevalent in the northern Indian belt, the 
HG O and its derivatives are predominantly seen in east- 
central and northeastern regions of India, mostly among 
tribals. HG L is restricted to various Dravidian speaking 
populations of India and some populations of Pakistan 
(Figure 74-2). The data in Table 74-2 and Figure 74-1, 
showing the stacked areas of various alleles in different 
populations, reveals a striking correlation between NRY 
composition and the languages they speak. This is further 
brought out by the principal component analysis (PCA) 
(Figure 74-2) of the data in Table 74-2. The first two com- 
ponents of the PCA account for more than 50% of the 
total variance (Figure 74-3). 

The geographical distribution of the NRY HGs 
described in the previous paragraph is clearly brought out 
in the PCA plot of various NRY HG frequency distribution 
in these populations The three language speakers, i.e., Indo- 
European (IE), Dravidian (DR) and Austro-Asiatics (AA), 
irrespective of tribes or castes, are seen to be influenced by 
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Figure 74-1 Picture of 100% stacked area of NRY HG allelic composition of 58 Indian and 5 Pakistani popula- 
tions (totaling 2447 samples) arranged according to the language they speak at present. A clear trend of 
various alleles in different language speakers was discernible. Note the relative distributions of Rla1-M17 
(yellow color), O2a-M95 (violet), H1-M52 (green); L-M20 (pink) F*-M89 F* (red) and C-M130 (black) in 
various language speakers. P = Pakistan; IE = Indo-European speakers; TB = Tibeto-Burmese speakers; 
AA = Austro-Asiatic speakers; and DR = Dravidian speakers, all from India. For exact population caste/ 


tribe names and their references, refer to Table 74-1. Refer color figure. 


various eigenvectors: thus the Dravidian speakers, mostly 
tribals, are distributed in upper right quadrangle, while 
the Orissa, West Bengal, and northeastern tribal popula- 
tions speaking AA languages cluster on the right bottom 
quadrangle of the plot. Many Brahmin and other popula- 
tions of northern India, speaking IE languages, are clus- 
tered on the left bottom quadrangle of the plot. The overlap 
between IE speakers and DR speakers seen in the middle 
of the plot can be attributed to either a confluence of two 
ancestors, miscegenation, or founders to varying degrees, 


or to language replacement. The populations found at the 
extremes, with highest Eigen in one direction, possessed 
the highest frequencies of one or another NRY allele. This 
can be attributed to a small founder or bottleneck effect, 
and uninterrupted expansion without any foreign gene 
flow. The terrain and climate of the eastern central India 
and northeastern India favors such a population expan- 
sion. However, the absence or low frequencies of many 
other NRY clades in the AA speakers, and the concentra- 
tion of these tribal populations in huge numbers in the 
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Figure 74-1 Picture of 100% stacked area of NRY HG allelic composition of 58 Indian and 5 Pakistani popula- 
tions (totaling 2447 samples) arranged according to the language they speak at present. A clear trend of 
various alleles in different language speakers was discernible. Note the relative distributions of R1a1-M17 
(yellow color), O2a-M95 (violet), H1-M52 (green); L-M20 (pink) F*-M89 F* (red) and C-M130 (black) in var- 
ious language speakers. P = Pakistan; IE = Indo-European speakers; TB = Tibeto-Burmese speakers; AA 

= Austro-Asiatic speakers; and DR = Dravidian speakers, all from India. For exact population caste / tribe 
names and their references, refer to Table 74-1. 
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Figure 74-3 The Scree Plot of the principal component analysis of NRY HG data of Indian populations 


available in literature. 


eastern central India/Orissa belt and the northeastern hilly 
tracts of India suggests a concomitant origin of these NRY 
clades and their language, and a spurt of huge expansion 
from the small founder. This is reiterated by the observa- 
tion that all these populations possessed very little of other 
parallel and later-derived NRY HGs. 

This proposition is further supported by the data that 
the Chinese and other oriental populations with the deriv- 
atives of O3 have are decedents from a common ances- 
tor somewhere from the northeast/Myanmar region (Shi 
et al., 2005). Basu et al. (2003) have suggested that the AA 
and TB speaking tribal groups might have entered India 
first from a northwest corridor and, much later, some 
through a northeast corridor. Contrary to this, Cordaux 
et al. (2004b) have proposed that northeast India acted as 
a barrier. Kumar et al. (2007), in an extensive recent study, 
identified a strong genetic link among sublinguistic groups 
of Indian AA-speaking populations and has suggested an 
origin of AAs in India who later spread to Southeast Asia. 
The analysis of the present study, in light of the population 
size and the extent of distribution of these AA-speaking 
tribal groups, reiterates the concomitant origin of these 
clades and their language. The time of origin of the clade 
O2a, i.e., 11,700 years ago (Sengupta et al., 2006; Table 
74-1) fits well with the assumptions that spoken language 
originated ~10,000 years ago. A very interesting observa- 
tion was the higher frequencies (50%-90%) of O2a in half 
of the AA-speaking populations hitherto available in lit- 
erature (Table 74-2). 


CONCLUSION 


India, the second continent to be successfully occupied 
by modern man, is heterogeneous in itself, in terms of 
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geography, climate, and populations. The whole of India 
thus cannot be considered as a single gene pool. The 
migratory history as revealed by NRY shows definitive 
pathways, origin, and autochthonous expansion of vari- 
ous NRY clades and populations in different parts of the 
country. Many of these populations are ancient than the 
languages they speak. Thus, as various languages devel- 
oped, presumably in small founders, the population 
expansion and language spread must have taken place 
concomitantly: hence, we see a good correlation between 
languages and NRY in India. 


SUMMARY 


Modern man (Homo sapiens sapiens), originating in 
Africa, first emigrated ~70,000 years ago, walked through 
the coasts of India (southern coastal route model) and 
reached Australia. Since then many migrations, settle- 
ments, and expansions have taken place in various parts 
of India. The island model of human settlements and 
expansions may explain the origin of settled communi- 
ties and languages in India.NRY chromosome markers 
help to unravel the details of early migrations of man into 
South Asia. The analysis of the literature thus suggests 
the origin and expansion of languages, superimposed 
by the genomic data: the data implies small founders, 
autochthonous origin (mutation) of new NRY markers, 
nuclear origins, and uninterrupted expansion / dispersal 
of populations and languages in India as exemplified by 
Austro-Asiatic (AA) speakers. The better communica- 
tion means and the language presumably led to the settle- 
ments (founders), rapid expansion, formation of culture 
and societies, and their dispersal to newer horizons and 
territories. South Asia and Southeast Asia thus seem to 
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be the cradle of many new founders, and autochthonous 
civilizations commensurate with languages that are pre- 
served till today. 


REFERENCES 


Aquadro CF, DuMont BB, Redd FA. (2001). Genome wide variation 
in human and fruitfly: a comparison. Curr Opinion Genet Dev 
11:627-634. 

Baltimore D. (2001). Our genome unveiled. Nature 409:814-816. 

Basu A, Mukherjee N, Roy S, et al. (2003). Ethnic India: A genomic 
view, with special reference to peopling and structure. Genome Res 
13:2277-2290. 

Bhasin MK. (2006). Genetics of castes and tribes of India: Indian pop- 
ulation milieu. Int J Hum Genet 6(3):233-274. 

Cann RL, Stoneking M, Wilson AC. (1987). Mitochondrial DNA and 
human evolution. Nature 325:31-36. 

Cann RL. (2001). Genetic clues to dispersal of human populations: 
Retracing the past from the present. Science 291:1742-1748. 

Capelli C, Wilson JF, Richards M, et al. (2001). A predominantly indig- 
enous paternal heritage for the Austronesian speaking peoples of 
insular South East Asia and Oceania. Am J Hum. Genet 68:432-443. 

Cavalli-Sforza LL, Menozzi P, Piazza A. (1994). The history and geog- 
raphy of human genes. Princeton, New Jersey: Princeton University 
Press, p. 1088. 

Cavalli-Sforza LL. (1997). Genes, peoples, and languages. Proc Natl 
Acad Sci US A 94:7719-7724. 

Cordaux R, Aunger R, Bentley G, Nasidze I, Sirajuddin SM, Stoneking 
M. (2004a). Independent origins of Indian caste and tribal paternal 
lineages. Curr Biol 14:231-235. 

Cordaux R, Weiss G, Saha N, Stoneking M. (2004b). The northeast 
Indian passageway: a barrier or corridor for human migrations. 
Mol Biol Evol 21:1525- 1533. 

Diamond J, Bellwood P. (2003). Farmers and their languages: The first 
expansions. Science 300:597-603. 

Eller E. (2001). Estimating relative population sizes from simulated 
data sets and the question of greater African effective size. Am J 
Phys Anthropol 116:1-12. 

Fuller D. (2003). An agricultural perspective on Dravidian historical 
linguistics: archaeological crop packages, livestock and Dravidian 
crop vocabulary. In Bellwood P, Renfrew C, eds. Examining The 
Farming/Language Dispersal Hypothesis. Cambridge: McDonald 
Institute for Archaeological Research, pp. 191-213. 

Giacomo Di F, Luca F, et al.(2004). Y chromosome haplogroup J as a 
signature of the post-neolithic colonization of Europe. Hum Genet 
115:357-371. 

Hammer MF, Karafet MT, Rasanayagam A, et al. (1998). Out of Africa 
and back again: nested cladistic analysis of human Y chromosome 
variation. Mol Biol Evol 15:427-441. 

Hammer ME, Redd AJ, Wood ET, et al. (2000). Jewish and Middle 
Eastern non-Jewish populations share a common pool of 
Y-chromosome biallelic haplotypes. Proc Natl Acad Sci US A 
97(12):6769-774. 

HUGO Pan-Asian SNP Consortium. (2009). Mapping Human Genetic 
Diversity in Asia. Science 326(5959):1541-1545. 

Karafet TM, Xu L, Du R, et al. (2001). Paternal population history of 
East Asia: Sources, patterns, and microevolutionary processes. Am 
J] Hum Genet 69:615-628 

Kavitha VJ. (2008). Studies on the Genomic Diversity of Southern Indian 
Breeding Isolates. PhD Thesis. Madurai Kamaraj University, India. 

Kayser M, Brauer S, Weiss G, et al. (2000). Melanesian origin of 
Polynesian Y chromosomes. Curr Biol 10:1237- 1246. 

Kivisild T, Rootsi S, Metspalu M, et al.(2003). The genetic signatures of 
earliest settlers persist in Indian tribal and caste populations. Am J 
Hum Genet 72:313-332. 


906 


Kumar V, Reddy AN, Babu JP, et al. (2007). Y-chromosome evidence 
suggests a common paternal heritage of Austro Asiatic popula- 
tions. BMC Evol Biol 28;7:47. 

Lell JT, Sukernik RI, Starikovskaya YB, et al. (2002). The duel origin 
and Siberian affinities of Native American Y chromosomes. Am J 
Hum Genet 70:192-198. 

Misra VN.(2001). Prehistoric colonisation of India. 
26:491-531. 

Nebel A, Filon D, Brinkmann B, Majumder PP, Faerman M, 
Oppenheim A. (2001). The Y chromosome pool of Jews as part 
of the genetic landscape of the Middle East. Am J Hum Genet 
69:1095-1112. 

Paddaya K. (1982). The Transition from Lower to Middle 
Paleolithic and the Origin of Modern Man. Ronen A, ed. British 
Archaeological Reports International series, Oxford, U.K.: vol 
151, pp. 257-264. 

Passarino G, Semino O, Magri C, et al. (2001). The 49 af haplo- 
type 11 is a new marker of the EU19 lineage that traces migra- 
tions from northern regions of the Black Sea. Hum Immunol 
62:922-932. 

Passarino G, Cavalleri GL, Cavalli-Sforza LL, Borresen-Dale A-L, 
Underhill PA. (2002). Different genetic components in the 
Norwegian population revealed by the analysis of mtDNA and Y 
chromosome polymorphisms. Eur J Hum Genet 10:521-529. 

Pauling L, Itano HA, Singer SJ and Wells IG. (1949). Sickle-cell ane- 
mia, a molecular disease. Science 110:543-548. 

Petraglia MD, Haslam M, Fuller DQ, Boivin N. (2010). The southern 
dispersal route and the spread of modern humans along the Indian 
Ocean rim: New hypotheses and evidence. Annals of Human 
Biology 37(3):288-311 

Qamar R, Ayub Q, Mohyuddin A, et al. (2002). Y chromosomal DNA 
variation in Pakistan. Am J Hum Genet 70:1107-1124. 

Quintana-Murci L, Krausz C, Zerjal T, et al.(2001).Y-chromosome 
lineages trace diffusion of people and languages in Southwestern 
Asia. Am J Hum Genet 68:537-542. 

Rosser ZH, Zerjal T, Hurles ME, et al.(2000). Y Chromosomal Diversity 
in Europe Is Clinal and Influenced Primarily by Geography, Rather 
than by Language. Am J Hum Genet 67:1526-1543. 

Sahoo S, Singh A, Himabindu G, et al. (2006). Prehistory of Indian 
Y chromosomes: Evaluating demic diffusion scenarios. Proc Natl 
Acad Sci U S A 103(4):843-848. 

Sanghvi LD, Balakrishnan V, Karve I. (1981). Biology of the people 
of Tamil Nadu. Pune: Indian Society of Human Genetics and 
Calcutta: Indian Anthropological Society. 

Seielstad M, YuldashevaN, SinghN, etal. (2003) A novel Y-chromosome 
variant puts an upper limit on the timing of first entry into the 
Americas. Am J Hum Genet 73(3):700-705. 

Semino O, Passarino G, Oefner PJ, et al.(2000). The genetic legacy of 
Paleolithic Homo sapiens sapiens in extant Europeans: a Y chromo- 
some perspective. Science 290:1155-1159. 

Semino O, Magri C, Benuzzi G, et al.(2004). Origin, Diffusion and 
Differentiation of Y-chromosomal Haplogroups E and J: Inferences 
on Neolithization of Europe and later migratory events in the 
Mediterranean Area. Am J] Hum Genet 74:1023-1034. 

Sengupta S, Zhivotovsky LA, King R, et al.(2006). Polarity and tempo- 
rality of highresolution Y chromosome distributions in India iden- 
tify both indigenous and exogenous expansions and reveal minor 
genetic influence of central Asian pastoralists. Am J Hum Genet 
78:202-221. 

Sharma S, Rai E, Sharma P, et al. (2009). The Indian origin of pater- 
nal haplogroup Rlal* substantiates the autochthonous origin of 
Brahmins and the caste system. J Hum Genet 54(1):47-55 

Shi H, Dong YL, Wen B, et al. (2005). Y-chromosome evidence of 
southern origin of the East Asian-specific haplogroup O3-M122. 
Am J Hum Genet 77(3):408-419. 

Su B, Xiao C, Deka R, Seielstad MT, et al. (2000). Y chromosome haplo- 
types reveal prehistorical migrations to the Himalayas. Hum Genet 
107:582-590 


J Biosci 


GENOMICS IN MEDICINE AND HEALTH—INDIAN SUBCONTINENT 


‘ZQP-9OPTL Jauay wing [ wy “eIsy [e1yUId OUT s}ySIsUT 
[ewosoworyo x :syuaaa yuasar Aq padeysas adeospur] Seue8 y 
(Z00Z) “OD YIWUS-A9]AT, “Y APPFEGIZNY ‘N PABYSEPINA Y STEM “L [e497 

“PUTYD ‘Teysurys ‘uo1ysiuvs1C amouay 
uvUNng ayy, ut payuasaid sadeg ‘suonendod ueiseang ur Ayssearp 
adAjouayd YIM JWA}sIsUOD ST SNIO[ YTOW 24} Ww wstydsiowAjog 
y8tH “(z00z) ‘Te 3° ‘We Aeryeqizma ‘Sa siPM ‘N eAcyseplnA 

‘L9p—-Son “dd “yy sstT 
DOK MAN ‘auapiag isso ayy fo Aaaung ppsoM, Y ‘suvWiNz Usapow 
fo sui811Q ayy, “spa “J 1a9UEedg ‘YI YW :U] “eISy Iseq Wor, euUap 
-IAd [Issoy ay} SULATOAUT UOFNTOAA prururoy jo A1oay} [e19Ues YW :suIs 
-110 suaidvg owopy UIIPOW *(F86T) “W eUoOIY], pue x NM “HIN Podjom, 
Ajatoos o1ydeis0ey [euoneN 
‘peloig s1ydei30uay ayy, episuy :Ansaouy daeq (£007) *S SIIaM 

“6VTOI-VVTOI:86 VSN 198 PYIV [HON 204d 
‘Ayisiaarp awosowosyp-A Jo aatjadsiad [eyuauTjUOD v :pueTIesYy 
ueiseing ayy, (1007) Te 39 “YU Aeryeqizny ‘N eAsyse[pNA “SY STM 

“FZ9-6S9:601 J2Uay winz ‘dno13 [euoneu-ouyje 

a8uTs & UTYIIM ainjons}s [eUOTBar 3uOI}s [eaves sadAjo[dey auosowr 
-O1Y) A URTUDIIY *(100Z) ‘Te 1? ‘Au 498e/ “| ueksodoysidax “AW [22M 

PRP-6LF(P)SI Jauay wny [ang ‘ety dnors0jdey 

UTY}IM sawosoWOIYD x URIS pur ueadoing Jo ArjsaouROD [eIDeIH 
-jsod ay} Buperedas *(Q10Z) ‘Te 19 ‘S 18}00N WN sed “Vd [FUFepuN 


‘e6r-Lep ‘dd TITAXT eunoA, 

‘ssa1g Aroyerogey sroqiey Suridg plop ‘Asojorg eateyUeNd 

uo eisodwiks soqiepy Butidg plop ‘saddjojdvyy awmosowosyo-X 
wmosf sanjy :Asojsipy uoungy Sutssafut “(€007) “Vd Iutepun 

Z9-€P:69 Jouay wnpy uuy ‘suoryepndod uewmy 

UJapour Jo suIsTI0 ayy pue sadAjojdey Areutq auosourorys X JO 
AydesZ0a80yAyd ayy, ‘(100Z) ‘Te 19 “VV UT ‘D Oulresseg “Vd [TH4ytepun, 

0¢-1:(Z-D)8 

jauay wing [ Jul ‘sewosowo1yD-{ UeIpUy Jo y}ays WYyde130a80] 

-kyd aatsuayaiduros y :suonendog uetpuy JO UIBII0 aUD04sI9]q 
jo syutiduiy onauey -(g00z)'Te 39 “WV YBUIS ‘S COYeS “Y IPeATIL 

“9@T-LITIS 

yauay wing [ ‘o8ejadryore IeqooIN ey} Jo uoNepndod Sururpep e 

uaduroys ay} JO SUTSIIO ay} OJUT s}YSIsuL Ie[NIz[OW *(900Z) “MA 
dedsey pure yq esis ‘y yBurs ‘{ aaftaaueg ‘], turysyepeys “Y IpeaTL, 

‘8€1-O8EL:IZZ 22ua19g ‘suIsII0 UeUNY 

usapou pue sndo] FAO ay} ye wINTAqyMbastp aseyur] Jo susoyjed 
[eqOTD *(966T) “Ul PPEX ‘LV susyed ‘M Peeds “q Ypszietq “WS HONYSLL 

“THL J2U2D OW “VNC [eUpuoyponu pur atuosouroryo X 

Woy aouaIezUT :etpuy Jo sdnosd [eqi1y pue sojseo JoMO] ay] SuoWE 
sarqruyge 219Uay *(900Z) ‘Te 39 *D AeqneyD ‘y feresuey], ‘| Weesuey], 

€6-98:€I Jolg 44nD ‘uonefndod ueumy Surystuea v ‘sJepueys] UeWepUY 
ay} Jo samrutye 2euUeay “(¢00Z) Te 39 ‘OV Appay “T yBuIg y fexeBueyy, 


